Name: | Ishaan Ivaturi |
---|---|
Email: | iai9 (at) scarletmail.rutgers.edu |
Education: | Rutgers University |
Partner | David Wang |
Mentor: | Dr.James Abello, Haoyang Zhang |
Project: | Graph Cities |
The goal of this project is to find interesting and/or unique subgraph motifs in massive graphs using the graph cities visualization tool, then generate them and eventually train a model to classify them.
For the first week, I explored the graph city model associated with patent citations, and found 6 interesting/unique patterns. I got somewhat comfortable with the graph city interface, and also delivered a presentation on my goals during this project.
For the second week, I started work on a C++ programming assignment. My task is to write a script to find connected components in the graph, form larger "meta-nodes", then count the number of vertices and edges in each "meta-node" as well as the edges between adjacent "meta-nodes". I got familiar with the input and output format, and planned out the assignment as a sequence of methods to write. I also got familiar with the boost library, since I will be using it to find connected components in the graph.
For the third week, I continued work on the C++ programming assignment. I completed methods to read in the input files and populate the boost library disjoint sets data structure. I then used it to find all the connected components in the graph. In order to keep runtime as fast as possible, I had to avoid structures like maps and sets since they come with unnecessary overhead. I instead used a vector to map the large and disjoint vertex numbers to local vertex numbers that start from 0 and increment, and did the same to map from representative local vertex numbers to node numbers.
This week, I implemented a multi-threaded version of the programming assignment. I first timed all of my methods and tested them on graphs of various sizes to try to determine the bottleneck method. I found it was the method where I count the number of vertices, edges, and meta-edges, so the program speed would benefit from implementing multi-threading on that method specifically. I learned about multi-threading in c++ from some online tutorials, then adapted my function to be multithreaded. Updating the vertices and edges didn't cause a problem since I could lock individual vector entries, but finding the edges between meta-nodes required a map. Storing space for every possible meta-node connection would take up too much space, so I resorted to using a map for each thread, then merging them together. This merging time counteracted the runtime improvement from multithreading, resulting in notable but not incredibly significant improvement. Overall, we determined that 2 threads was a good tradeoff between speed and program safety.
For week 6, I tested the effect of compressing the files on the speed of my script. I compressed the csv files using gzip, and then used the boost library's gzip decompressor to uncompress the file line by line in my program. I found that the runtime cost of uncompressing the file counteracted the runtime improvement of reading in a smaller file, so operating on the compressed files was actually about 20 percent slower. I compiled the file reading times into a table and made a bar graph to display the effect of compression on progressively larger files.
-->