DIMACS
DIMACS REU 2025

General Information

me
Student: Rohit Bhagat
Office: CoRE 446
School: Rutgers University
E-mail: rohit.bhagat@rutgers.edu
Project: Learning, Random Walks, and Error Correction over Deletion Channels
Mentor: Salim El Rouayheb

Project Description

This project aims to develop efficient codes for communicating over the deletion channel. The deletion channel remains an open problem. The goal of this project is to explore redundancies in the English language to create a code that is resilient to deletions in the adversarial case. This will involve the use of graph structures and random walks as well as LLMs.


Weekly Log

Week 1:
I moved in to Rutgers this week and familiarized myself with the program through orientations. My mentor and I had a preliminary meeting to discuss the general direction we would like to take the project in. I began some background reading on the topic, including Hamming codes and VT codes. I also started reading chapters of Cover's Elements of Information Theory. Aside from that, this week was mostly about getting situated and meeting all the great people here at the DIMACS REU program. I truly enjoy being around like-minded people and learning about their interests and project topics. We also run our own social events, and we've definitely started forming a nice community.
Week 2:
I continued covering more background material, such as reading Chapters 2 and 3 of Cover's Elements of Information Theory and Cover's Information Theory lectures. I also prepared and delivered Presentation 1. The main work for this week was experimenting with the capabilities of ChatGPT to recover text with deleted characters. I wrote a Python workflow that would input a recent news article and delete each character with a fixed probabiility. That would be fed into ChatGPT, and ChatGPT would try to recover the original text. I kept increasing the deletion probability until ChatGPT could no longer accurately reconstruct the original text. This shows that there is some inherent redundancy in the English language that can be exploited to develop codes that can be used for the deletion channel.
Week 3:
I wrote a web scraper to easily obtain texts from the Internet (recent news articles) to feed into ChatGPT. I also optimized the workflow to make it easier to feed prompts into ChatGPT, and I finished reading Chapter 3 of Cover and began Chapter 7. I also came up with a basic (though not very efficient) code for the deletion channel, which can probably be optimized. Since the idea is to map binary to English characters and transmit the English over the channel, I thought about mapping every sequence of 16 bits to a 4 letter English word. Also, we had Culture Day on Friday, for which I gave two presentations.
Week 4:
I continued scraping news articles and feeding deleted texts into ChatGPT with different deletion probabilities to analyze ChatGPT's probability of error. I've started looking for a phase shift—a critical deletion probability such that the probabilility of error is 0 (or around 10-6) for all deletion probabilities below it and the probability of error is non-zero for all deletion probabilities above it. This requires a massive text on the order of millions of characters, and the text must be "clean" of grammatical errors and specific unpredictable information such as numbers. I've begun using AI to generate such clean texts. I also continued reading Chapter 7 of Cover.

My mentor and I also discussed possible means of encoding binary to English. One such possibility is mapping sequences of bits into words by indexing the English dictionary. This will be our first mapping attempt, and hopefully it performs better than benchmark methods from coding theory such as repeition code. We also discussed the possibility of removing all spaces from a text because the text is still decipherable. This would be a good way to increase efficiency.
Week 5:
This week was focused on obtaining a very large amount of sample text. We want to verify ChatGPT's ability to handle deletion probabilities on the order of 10-6, and this requires a text file of at least 1 million characters. I used AI to generate massive human-like texts. I've starting to utilize the OpenAPI API to significantly speed up the process of generating texts and reconstructing deleted texts. Next week will be geared towards finding a phase shift as described earlier and trying out basic codes mapping binary to English. I also spent this week learning about GPT transformer architecture and reading more about information theory.
Week 6:
This week, I was able to create a pipeline that would read in plain English text, delete characters in the text in an i.i.d. fashion, send the deleted text to GPT via API call, and retrieve the reconstructed version via API. There was a bit of pre-processing needed to clean the texts, such as normalizing Unicode characters. I was able to test deletion probabilities from 10-4 to 10-1, and I observed a phase transition below which GPT is able to reconstruct texts with a sufficiently low probability of error. I also implemented a rudimentary code (discussed earlier) mapping binary strings of a fixed length to 5-letter words using a dictionary, separated by spaces. GPT was able to recover deleted characters fairly well, but there are some limitations. Next week, I will look into mapping binary strings to English sentences, for which GPT should exhibit a much better performance due to the inter-word correlations that are present in proper sentences.
Week 7:
I spent the majority of this week reading about various topics in information theory. Namely, I learned about Huffman codes, LZW (Lempel-Ziv-Welch codes), and the entropy of the English language. I also read a survey paper about results on the deletion channel. Aside from learning, I jotted down a rough sketch of how I want my final presentation to look like. Next week, I'll focus on creating and delivering my final presentation, possibly adding in some new results aside from what I currently have.

Presentations