DIMACS
DIMACS REU 2018

General Information

me
Student: Elise Catania
Office: 442 CoRE
School: University of Rochester
Major: Mathematics
Minor: Spanish
Clusters: Chemistry, American Sign Language, and Psychology
E-mail: ecatania@u.rochester.edu
Project: Codes for Storage with Queues for Access

Project Description

In industry, there is an emphasis on big data as well as machine learning. Computers must frequently process and compute extremely large data sets. Instead of depending on one computer to complete the job, the job is parallelized and divided into k tasks between multiple servers, "workers." After parallelizing, stragglers, the slowest workers, are considered. To avoid the situation of stragglers, k tasks are encoded into n tasks, where k is less than or equal to n. For each job sent to the servers, the first k out of n tasks finished will recover all of the information needed to complete the job. The goal of the project is to find the optimal expected time for a job to be completed with respect to random parameters. Combinatorial designs as well as order statistics of random variables will be used to find the most efficient method for completing a job.


Sequence {123, 145, 167, 246, 257, 347, 356}

Here is a visualization of tasks showing up to servers. Blocks of the same color represent tasks of the same job. Once all seven jobs are sent to the servers, the sequence repeats itself.
queue-filling scheme

Weekly Log

Week 1:
The first day of the program served as an orientation. We had time to meet the fellow students, faculty, and mentors of the program. I met with my mentor, Dr. Emina Soljanin, and she explained the project I will be working on this summer, "Codes for Storage with Queues for Access." I also had the opportunity to meet the graduate student from Dr. Soljanin's lab, Amir Behrouzi-Far. In addition, Dr. Soljanin recommended several papers to read in order to gain more insight into the project. These papers include Matroids You Have Known and The Many Names of (7 3 1).
Week 2:
On Monday, I gave the first presentation on my project, "Codes for Storage with Queues for Access". I enjoyed sharing the details of the project with my fellow REU students. It was also very interesting to see the projects my peers are working on. This week, I have been studying the (7 3 1)-balanced and incomplete block design using the Fano Plane. The Fano plane has seven lines and seven points, such that three points lie on each line. We use the Fano plane as a model to analyze the scenario where there are a total of seven servers and tasks are sent to three out of the seven servers at a time. We can label the points as seen in the diagram below. From the three points on each of the seven lines, we get the set {123,145,167,246,257,347,356}. Within each element of the set, each pair of numbers occurs exactly once. We capitalize on this structural element of the Fano plane, since it allows for minimal overlap between the servers. I am in the process of strategically assigning tasks to arrive at servers in a particular order. My goal is to see whether the ordering of the sequence affects job completion time. We will run sequences through our simulation, to see the job completion time associated with each sequence.

Week 3:
This week consisted of many simulations. There are 5,040 permutations from the set we get from the Fano Plane. However, since tasks arrive cyclically to the servers, there is a lot of overlap between the sequences. We can simplify the 5,040 sequences to 720 unique sequences. Running all 720 sequences through our simulation is much more manageable than 5,040. We ran the (7 3 1)-BIBD for k=1,2,3 in order to see which k was optimal for this design. For (7 3 1)-BIBD, 2 proves to be the most efficient k value. We also ran the (7 3 1)-BIBD against random in order to see if in fact, the (7 3 1)-BIBD would yield more efficient job completion times. Just as we thought, we can do better than random. Random means that out of m servers, we choose n servers randomly. Another BIBD that we wanted to test was the (10 4 2) design. We ran the (10 4 2) for k=1,2,3,4 and then ran it against random. Once again, the optimal k value was 2. It seems that the most efficent k is roughly in the middle of all the possible k values for each design. We also looked at fork designs as well as the "round robin" design. Round robin is a scheme where tasks are sent to the first n servers, then the next n, and so on. The fork design seems to be the best, then comes the round robin, and then random comes in last. We generated many plots supporting this information (the proof is in the pudding and if you would like some pudding, then stay tuned).
Week 4:
I am studying the exponential, shifted exponential, Pareto type 1 and Pareto type 2 distributions. I am looking at the characteristics of these four distributions such as their probability distribution function (PDF), cumulative distribution function (CDF), complementary cumulative distribution function (CCDF), expected value, and standard deviation. In our old model, service time was modeled exponentially in μ, rexp(μ), where μ is service rate. Parallelizing the job into k tasks then makes service rate exponentially distributed in μk, rexp(μk). We are now considering a more realistic model, where we consider two components. One component is Δ, job size. The other component is τ, which is distributed rexp(μ). The new model requires a balancing factor, so that neither job-specific randomness nor server randomness dominates one over the other. The new model for service time is α(∆)/k+(1−α)τ. Our next step is to run simulations for constant ∆.
Week 5:
This week we were introduced to a vertex covering problem, still incorporating servers and service time. In the problem, we have a combination of systematic servers, which only contain one type of file, as well as coded servers. If there are k files one needs k coded servers to retrieve information from one of the files. We assume that each of the servers has a service rate of 1.0 and we are looking to maximize the service capacity of the system. We can think of this problem as a graph with nodes and edges. From the graph representation, we can make a system of inequalities from each of the nodes. At first, we approached this problem from a linear algebra perspective, trying to manipulate the system of inequalities that we get from each of the encoded nodes. However, we were working with a system of inequalities and therefore, the linear algebra approach wasn't as friendly as we initially hoped. It seemed that solving this problem using linear programming would be the most efficient method.
Week 6:
This week we continued to work on the vertex covering problem. I read Dr. Soljanin's paper On the Service Capacity Region of Accessing Erasure Coded Content. I have also started to prepare for the final presentation. Dalton and I will be presenting either next Thursday or Friday.
Week 7:
This Thursday and Friday are final presentations and Dalton and I will be presenting on Friday. I have been working on the presentation: designing the slides, organizing our simulation results, and choosing the main points and important results we would like to present to the fellow REU students and faculty.
Week 8:
I am looking at the "mice and elephants" problem. In the first few weeks, we made a simplifying assumption that each job is the same size. However, this isn't very realistic. Now, we would like to see what happens when we have a mix of very large jobs, elephants and very small jobs, mice. We would like to know how the mice and elephants situation will affect the existing hierarchy of best to worst schemes. We think that combinatorial design will be most efficient. To see whether our hypothesis is true, we will run many simulations this week.
Week 9:
This is the final week at the REU. I am working on the final paper and poster. In addition, we are continuing to work on the mice and elephants example. Ideally, if we had more time to work on this problem, and made our work a little more rigorous, we could have meaningful results, which would be very exciting. Since it is the last week, I have been reflecting on my time here at DIMACS. I have enjoyed meeting students from all over the country as well as a group of students from the Czech Republic. I had never met so many students prone to spontaneous outbursts of math discussion!

Presentations


Additional Information

Acknowledgements

I would like to thank the NSF for supporting my research. Thank you to my mentors Dr. Emina Soljanin and graduate student Amir Behrouzi-Far for their guidance throughout my project. In addition, I would like to thank Dr. Lazaros Gallos and the DIMACS team for making this program possible.