Ishaan Ivaturi DIMACS REU 2022

Name:	Ishaan Ivaturi
Email:	iai9 (at) scarletmail.rutgers.edu
Education:	Rutgers University
Partner	David Wang
Mentor:	Dr.James Abello, Haoyang Zhang
Project:	Graph Cities

About My Project

The goal of this project is to find interesting and/or unique subgraph motifs in massive graphs using the graph cities visualization tool, then generate them and eventually train a model to classify them.

Weekly Summary

Week 1

For the first week, I explored the graph city model associated with patent citations, and found 6 interesting/unique patterns. I got somewhat comfortable with the graph city interface, and also delivered a presentation on my goals during this project.

Week 2

For the second week, I started work on a C++ programming assignment. My task is to write a script to find connected components in the graph, form larger "meta-nodes", then count the number of vertices and edges in each "meta-node" as well as the edges between adjacent "meta-nodes". I got familiar with the input and output format, and planned out the assignment as a sequence of methods to write. I also got familiar with the boost library, since I will be using it to find connected components in the graph.

Week 3

For the third week, I continued work on the C++ programming assignment. I completed methods to read in the input files and populate the boost library disjoint sets data structure. I then used it to find all the connected components in the graph. In order to keep runtime as fast as possible, I had to avoid structures like maps and sets since they come with unnecessary overhead. I instead used a vector to map the large and disjoint vertex numbers to local vertex numbers that start from 0 and increment, and did the same to map from representative local vertex numbers to node numbers.

Week 4

For the fourth week, I finished the single-threaded version of the programming assignment. My script reads from some json and csv files describing the graph and providing the edge list, and outputs in json some information about the corresponding meta-graph. It describes each node individually, listing out the internal vertices and edges. It then lists out all the vertex numbers in that meta-node. Finally, it outputs the number of edges between meta-nodes that are one layer apart, to describe a spanning meta-DAG.

Week 5

This week, I implemented a multi-threaded version of the programming assignment. I first timed all of my methods and tested them on graphs of various sizes to try to determine the bottleneck method. I found it was the method where I count the number of vertices, edges, and meta-edges, so the program speed would benefit from implementing multi-threading on that method specifically. I learned about multi-threading in c++ from some online tutorials, then adapted my function to be multithreaded. Updating the vertices and edges didn't cause a problem since I could lock individual vector entries, but finding the edges between meta-nodes required a map. Storing space for every possible meta-node connection would take up too much space, so I resorted to using a map for each thread, then merging them together. This merging time counteracted the runtime improvement from multithreading, resulting in notable but not incredibly significant improvement. Overall, we determined that 2 threads was a good tradeoff between speed and program safety.

Week 6

For week 6, I tested the effect of compressing the files on the speed of my script. I compressed the csv files using gzip, and then used the boost library's gzip decompressor to uncompress the file line by line in my program. I found that the runtime cost of uncompressing the file counteracted the runtime improvement of reading in a smaller file, so operating on the compressed files was actually about 20 percent slower. I compiled the file reading times into a table and made a bar graph to display the effect of compression on progressively larger files.

Week 7

Week 8

Week 9

-->

References & Links

Here is the REU website:

The REU Website

General Information