WikiMap: Pivot

Raj
5 min readMay 22, 2017

--

This is intended as an update to my previous post. If you haven’t done so already, read WikiMap: Genesis!

As always, the latest version of the source code is available on my Github.

I left the last post with an eye towards transactions as a way to improve efficiency and hopefully speed up the process. While I was at it, I decided to also use better programming practices such as using

if __name__ == '__main__':
#...

I also implemented a Manager and an exit handler so that when I inevitably hit ^C, I wouldn’t orphan the child processes.

The downside to this is it caused incredible overhead. I don’t know if it was the transactions or the better programming practices, but the rate of relationship creation slowed dramatically. The code achieved roughly 100k relationships in fifteen minutes, compared to the 1m/hour rate of the previous iteration (250k/15minutes).

I suspected that this might have to do with the “better practices” programming introducing additional, unnecessary overhead. To test this, I undid quite a bit of the “better practices”: I removed the Manager, removed the “__name__ == ‘__main__” guard, and reverted to a normal, global queue variable instead of passing it in through a parameter and overloading the __init__() method of my Process classes. Unfortunately, this didn’t produce any real change. In fact, in my loose benchmark, the rate went down if anything. Its very possible that I made some other mistake when performing these calculations, but the bottom line is that introducing transactions did not help increase speed by any measure.

Transactions did, however, introduce an error. On the call to tx.commit(), the process was sometimes unable to obtain a lock due to the database probably being accessed by another process at the same time. I am running the community version of Neo4J, so it is very possible that there is a cap on the number of concurrent connections. The error raised is not even in the Neo4J Python Transaction documentation — for the curious, it is a

neo4j.exceptions.TransientError: LockClient can't wait on RWLock

This occurred regardless of whether I used session.commit_transaction() or tx.commit(). It is probably because committing a transaction should take significantly longer than committing a single statement to the database, allowing for a greater overlap for collisions between the processes. This made me suspect that there was an “ideal” number of workers. I had been using 8, but I upped it to 12 in hopes of a little gain. To try to determine the ideal working time, I modified my code a little for testing. Essentially, I iterated between 2 and 12 workers, running a timer for how long it takes the program to finish creating all relationships for 1000 nodes. Once the control sequence exists, it runs a query to delete all relationships, and then restarts with a new number of workers, while saving the time taken to a file.

I chose to use a simple time.time() call before and after the code block instead of using the timeit library due to time constraints. I do not have the time to run several trials of each of the number of workers. However, due to the relatively enormous amount of time each iteration takes, I do not believe that this will be an issue. The performance of the computer will probably average out over time, so my error should be quite low. A few seconds difference is nothing when we’re dealing with minutes and potentially hours.

Note: for the purposes of testing the time, I left the “good practices” in the code. Since the time was similar with and without the good practices, I felt as if there was no harm in leaving them there. Strangely, after one run, there seemed to be no discernable pattern

Since my single-iteration hypothesis did not pan out, I re-ran the code with 5 iterations. If I were to do this, then it would take up to 9 minutes, so 9 * 12 * 5 = 6 hours to run testing. In the interest of time, I cut the nodes in half, so it only did 500 iterations.

500 Node Run with 5 trials of each

The two subplots to the top use median values, the two to the bottom use mean values. Due to the low count (n=5), means are subject to influence by outliers. Both plots tell a similar story; there is not much processing time difference by using several processes. The error bars are Standard Error with a 95% confidence interval, computed using 1.96*stdev(five trials)/sqrt(5). The fitted axis is shown to zoom in on the data, while the zeroed axis is to show a proper scale.

Although this clearly shows that there isn’t a significant difference between the number of processes, the larger processes all have a smaller mean and median than the run with two processes. Additionally, there is a critical flaw in the test: it does not test edge-case performance. As I mentioned in the previous post, there are certain lines which take a long time to process. As these lines weren’t included in the test, it is hard to say this data conclusively proves anything.

I stopped here to reconsider some basic assumptions and do some calculations. I need to determine what the ideal time to do this computation is, which will be the subject of my next post. I’ll try to find the lower-bound on data processing time which may help me determine how close to “ideal” I am and whether I should try to find someone else’s computer to do the computation on.

The next story is up! Read WikiMap: Optimization!

--

--

Raj

Enthusiast about the future of society and technology