Racing through Wikipedia with Python

Parul Baweja
4 min readApr 8, 2018

--

Have you heard of WikiRace? Basically, it’s a game that challenges players to see how many clicks it takes to get from one Wikipedia page to another.

Example: It takes 3 clicks to get from ‘Python’ to ‘Ringo Starr.’

I built a Python tool to solve the WikiRace game. The CLI script takes a ‘Source’ and ‘Destination,’ and returns a path of Wikipedia page titles from the former to the latter. Let’s visualize with a graph:

Each node of the graph has a title (i.e. ‘Python’) and children (links to other pages i.e. ‘Monty Python’, ‘CPython’, etc). The core question is: how can we traverse this graph to find the shortest path from the start node to the end node?

The project was a great exploration of concurrency, asynchronous programming and breadth-first search. But, importantly, my workflow was a good case-study in iterative problem-solving.

My MVPs

  1. A Synchronous BFS — In this iteration, I made an API request to WikiMedia at each node, received its children and added them to the queue, until I found the one that matches the destination.

For two topics with say 4 pages between and tons of children to search, this version took 30+ minutes to complete. 30 MINUTES! Why? Because I was waiting for a response from the WikiMedia API before continuing my search.

Send a request. Receive a response. Add to queue. Repeat.

Let’s go asynchronous.

2. An Asynchronous BFS — Instead of making one API request at a time, I used Python’s asyncio library to make multiple requests at once, with responses arriving not necessarily in order.

Ok, I got it down to 4 minutes. Much, much better. But, why not search from both sides?

3. An Asynchronous Double-Ended BFS — Lastly, I initiated a BFS from both directions, so the search would end somewhere in the middle. This required API requests to WikiMedia for links out from the start page and links to the end page.

1.2 seconds. Now 1000x faster than MVP 1.

Notes on Concurrency

To tackle this problem, I investigated using multiple processes, multi-threading and coroutines. Here’s what I found.

Multiple processes don’t share memory, so this was a fail fast idea for my needs — I needed to continue to check whether the other BFS had found a matching node.

Multiple threads in Python may involve the following issues:

  • Briefly, Python has a Global Interpreter Lock (GIL), which carefully controls execution: only one thread can execute in the interpreter at once. This prevents the typical issues of race conditions or deadlock. I liked David Beazley’s talk on this — slides provided!
  • Context-switching overhead — simply stated, the OS must switch between threads to reach completion. Furthermore, threads require all register state to be saved on context switch; its cache state must also be persisted into main memory.
  • Extra resources — Each thread requires a new stack.
  • Context switches are done “randomly” or at non-deterministic intervals by the OS. Rather, a strategic approach involves context switching when a blocking operation (such as I/O) occurs.

What did Python’s asyncio library provide?

  • Deterministic context switching — I define when to context switch by calling ‘await’ when blocked on an I/O resource. This is more efficient and strategic!
  • Light weight — I can spin up 100s of coroutines, while threads remain slow and more demanding in terms of resources.

Lessons Learned

  • My initial MVP, although not performant, was really crucial in helping me gain momentum and problem-solve. Once I had a working solution down, I could narrow and resolve pain points.
  • Asynchronous programming in terms of coroutines is incredibly useful for any repetitive I/O tasks.
  • Concurrent does not mean parallel.
  • Race conditions — Let’s say two threads have access to the same variable (the number 1). They both want to modify it (add 1). If they both access it at the ~same~ time, then what will the variable look like after (2 or 3)?
  • Deadlock — Two threads have access to the same resources and need them to complete, but do not relinquish to the other thread. Neither completes.

--

--