WikiMap: Optimization

Raj Shrimali
May 23, 2017 · 4 min read

This continues the saga of my Wikipedia Map project. Read the first one here or the previous one here. I finished the last post thinking that there was no benefit of threads. Internally, I couldn’t accept that, so I left a bigger test running overnight, one with 1000 nodes. I’m glad I did, because the results were strikingly similar.

First, let’s see the graphs from last time again to refresh our memory.

Median Values for 500 Node Run
Mean Values for 500 Node Run

Initially, I thought there was no pattern to be drawn. The following graphs changed my mind.

The patterns are identical, which makes me believe that this is the pattern to be expected. The only real conclusion is that 12 and 4 are the fastest-working processes, with 8 being close.

I will stick to 12 process in the future to minimize the chance that all of the processes get stuck on a huge text line. Unfortunately, I started the timing a little bit late, but I will update this post when I have numbers for the theoretical lower bound I want to calculate. Off of rough estimates, I remember the node creation taking roughly 5–6 hours, which would put the lower bound at roughly 40 days which is where I am at currently.

[Update May 23, 9:36 AM] It seems as the code I wrote to time node creation runs slower than before, despite being nearly identical except for using two processes for node generation. It might have to do with the way I handle sessions; I also ‘properly’ used the .consume() method of the result of each query, which could have something to do with it. I’ll update the post once I figure out what the issue is.

[Update May 24, 8:10 AM] Creating a node is perhaps the simplest operation that I can run; it requires no sub-querying or any other operation apart from the actual “create”. The time Neo4J took to create 6257507 nodes was 18678.45 seconds. This time is higher than the “average-case” time; I heavily utilized my computer while the operation was running to the point where everything started lagging. The lag would certainly carry over to the timing operation, increasing the estimated time. This comes out to 0.00298 seconds per node.

We know that the computer must perform at least two queries for every relationship creation operation: one to find the starting node and one to find the ending node. Prior to running map creation I’ll obviously build all necessary indexes on my Neo4J properties to improve search time, but as n is large, I can expect a higher search time for each node.

[Update May 24, 9:00 PM] I confirmed my earlier hypothesis that the speed decrease was due to my “proper” usage of sessions and transactions. By switching to my earlier method of only closing a session every 100,000 statements, I cut the time utility by over 66%; although tests are still running at the time of this update, I was seeing 1000-Node runtimes as low as 150 seconds. After doing the math, this comes out to 150/1000 * 6257507 = 938,626 seconds or ~11 days. 11 days is far more acceptable, and is a period I am willing to wait.

[Update May 25, 2017, 6:30 AM] My tests finished running, and the change is incredible. Let the following graph summarize the new time requirements:

Processing Times for 1000 Nodes after dramatically reducing the number of sessions created.

There is a much clearer downward trend among this data compared to the graphs from earlier. By reducing the number of sessions created, I saved over 200 seconds per run. The mean time for 12 threads was 176.34 seconds, while for 10 processes it came out to 167.19s. Although 10 was lower than 12, the overall trend makes me think that 12 might be for all intents and purposes equal to 10 in terms of speed. The medians seem to support this theory as the 12 median time of 162.74 compared to the 10 median time of 164.74.

Regardless, the difference of 10 seconds over 1000 nodes processed is (only) 17 hours, which really isn’t much in the grand scheme of processing. If I use 12 processes I can expect the code to run in 12.8 days, while with 10 I can expect 12.1 days. In order to decide which to use, I ran another test with 10 iterations and 1500 nodes using 10 and 12 threads. I’ll update this post when that finishes running as well.

[Final Update May 25, 9:00 AM] Once the tests were done, 10 and 12 had a nearly equal processing time. 12 processes had a mean time of 251.98 seconds with a standard deviation of 26.20 seconds, while 10 processes had a mean time of 249.19 seconds with a standard deviation of 23.19 seconds. In the interest of RAM and my own computer’s sanity, I’m going to go with 10 threads instead of 12, as 10 threads is equal to if not better than 12 threads. There will be more in this series, but this is the final optimization step I’m taking for this project.

The next one is up! Read WikiMap: Processing!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade