WikiMap: Processing

Raj
3 min readJun 1, 2017

--

This continues the WikiMap series. If you haven’t already, read the first one, WikiMap: Genesis!

I finished the last post with a series of updates as my optimization tests completed. By the end of it, I had created a decent command-line tool that, with arguments, would allow me to either test or run the full WikiMap generation function. For those following along on my github, the file is misnamed; although it is called MultiThreaded.py, it doesn’t use threading; it uses multiprocessing instead.

I chose to run with 10 threads, which has worked out OK so far. Unfortunately, a few hours in, there were always small bugs I had to patch and restart. Backing up the Neo4J database in a tarball prior to building the relationships has reduced time between testing from a few hours to a few minutes, which has helped speed up this process significantly.

As I type this, the estimated processing time for the map generation is fluctuation between 9 and 13 days, which is far better than the 134 days I started off expecting. While the code runs in the background, I’ll take this opportunity to summarize sum of the things I have learned thus far.

The advantage of structured code

This may seem like a no-brainer, but as my Github history can attest to, the structure developed with time. By implementing proper coding techniques, I was able to reuse more code and make the code maintainable. Modifying the main program to use command line arguments was quite simple compared to the initial process of making the code thread-safe to begin with.

Multiprocessing vs. Multithreading

To the advanced reader, this is self-explanatory. Multiprocessing involves entirely different programs running, while multithreading is the same program having different “parts” all trading places. Multiprocessing gets around the Global Interpreter Lock, while multithreading is still prone to it. In this case, multiprocessing was definitely the way to go.

There’s a time to use “hacky” code…

Had I followed the proper use for the Neo4J driver entirely, the code would have taken significantly longer to run. It is possible that, with transactions done right, I could shorten the time even further. However, this could have its other issues; the database could then have a massive load of statements come in at once, while sitting idle the rest of the time. In this situation, where time is very much an issue, using the hacky session solution seemed like a better way to go about it.

… but good code comes first

I already said this in my first lesson learned, but seriously, I cannot emphasize how important it is to use well-designed and well-planned code for long term usability. Not doing so off the bat probably added at least a week to this project.

Use tests when possible

My initial way to measure time was “lets run this for a bit and see how far it gets.” Needless to say, that was not very effective. When I implemented a testing protocol and gathered data to make more effective decisions, I was able to more quickly make more justifiable coding decisions. Increasing the scale of my tests further validated the data I gathered, and allowed me to choose the theoretically fastest process number for my computer

Names matter

I’ve used threading and processing so incorrectly throughout this series that a causal reader will be very confused. I apologize for this and, when I have the time, I’ll go back and try to use each term correctly. I’ll even refactor the code base when I can.

The next post will be after the processing finishes and after I have gathered actual data to present.

The next one is up! WikiMap: (Semi) Conclusion!

--

--

Raj

Enthusiast about the future of society and technology