Cheating at Wikirace: An incite on big networks with Wikipedia

Hadrien Croubois
6 min readJul 22, 2018

--

What is a Wikirace and why would you want to cheat?

A Wikirace (IPA: /ˈwɪ.ki.rɛɪs/) is a race between any number of participants, using links to travel from one Wikipedia page to another. The first person to reach the destination page, or the person that reaches the destination using the fewest links, wins the race. Intermediary pages may also be required.

Obviously, there are (many) variations of the rules. I focus here on the rules used in my university, even though it doesn’t affect the analysis that follows. We play on the French domain (https://fr.wikipedia.org) and are not allowed to use any portal, category or list page.

I enjoy this game, but from a computer scientist point of view, it feels wrong. Why painfully do something that a computer would do better and faster? After all, this is just a graph traversal. The exciting part is that knowledge of the graph is implicit. You don’t know what the links are, but you expect some of them to exist (like from physics to chemistry or from Rome to Italy). While human players tend to follow conceptual paths, there are very efficient shortcuts (like using month/year pages) that they human players don’t use.

This whole project started with a simple question:

What would a shortest path look like, and how to find it?

Parsing and saving the graph adjacency graph

Every computer scientist knows that finding the shortest path in a graph can be done using Dijkstra’s algorithm. While there have been improvements, in our case all the (edges) links have the same weight. No need for A* or other fancy algorithms, we stay simple in that regard.

While Wikipedia (fr.wikipedia.org) contains about 2 millions pages, only a small fraction of these contain more than a few hundred links. The graph’s adjacency matrix is therefore very sparse and should be stored accordingly.

My first approach was to use a relational database management system to store it. In postgresql I build two tables:

CREATE TABLE pages (
id integer PRIMARY KEY DEFAULT nextval(‘page_key’::regclass),
address text UNIQUE NOT NULL,
processed boolean DEFAULT FALSE
);

CREATE TABLE links (
id integer PRIMARY KEY DEFAULT nextval(‘link_key’::regclass),
source integer REFERENCES pages(id),
target integer REFERENCES pages(id)
);

We need a logic to manage pages. While we traverse the graph, we find links to many pages. Some already in the database, other not. In all cases, we need an identifier to point to these pages. This means that each time we discover and page we should check whether it is registered and insert it if it is not the case. All registered pages that have not been marked as processed should be processed at some point.

This lead me to build the following plpgsql function:

CREATE OR REPLACE FUNCTION get_page_id (_address text)
RETURNS integer AS $pageID$
declare
pageID integer;
BEGIN
— search page id
SELECT pages.id
FROM pages
WHERE pages.address = _address
INTO pageID;
— if page does not exist, create it
IF NOT FOUND THEN
INSERT INTO pages(address)
VALUES (_address)
ON CONFLICT (address) DO NOTHING
RETURNING pages.id INTO pageID;
END IF;
— return id
RETURN pageID;
END;
$pageID$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION get_page_to_process ()
RETURNS text AS $address$
declare
address text;
BEGIN
SELECT pages.address
FROM pages
WHERE NOT pages.processed
ORDER BY RANDOM()
LIMIT 1
INTO address;
RETURN address;
END;
$address$ LANGUAGE plpgsql;

All that is left is implement the ingestion mechanism. It’s job is pretty simple:

  • Get a page address
  • Download and parse the page
  • Isolate the <div id=’content’>
  • Process the links in it using a regexp (to eliminate all invalid pages)
  • Send the link list back to the database

The processing function was not that complicated, but turned out to be very slow:

CREATE OR REPLACE FUNCTION process_page (_source text, _targets text[])
RETURNS integer AS $count$
declare
sourceID integer;
count integer;
BEGIN
— get source id
SELECT get_page_id(_source)
INTO sourceID;
— cleanup links for source page
DELETE FROM links
WHERE links.source = sourceID;
— insert new links (and count)
WITH rows AS (
INSERT INTO links (source, target)
SELECT sourceID, get_page_id(target)
FROM unnest(_targets) AS target
RETURNING 1
)
SELECT count(*)
FROM rows
INTO count;
— update page status
UPDATE pages
SET processed = TRUE
WHERE pages.id = sourceID;
— return number of links
RETURN count;
END;
$count$ LANGUAGE plpgsql;

While this approach works, the database felt very slow, and graph traversal would prove to be an issue. I considered using the SPARQL format, but finding and installing a SPARQL database engine proved to be harder than necessary

Scaling up with Python and ZMQ

I know, scaling up and python do not usually go together. However, Python code perfect for quick prototyping and offers modules for regexp, url parsing, xml traversal, url request …

The second idea was simple: have a Python object store the database, with forward and backward indexing (that killed the SQL approach), and store this object using pickle.

A Python daemon would hold the database in memory and would expose sockets for the ingesters to work with it. It would save the database regularly (and when stopping).

The ingesters, on the other hand, would be stateless. They ask the address of a page to process (REQ-REP), do their work, and send the results to the core (PUSH-PULL). ZMQ is my go-to library for this kind of parallelism, and once again it proved successful.

Core process (on the right) and 8 ingesters serving it (on the left)

Analysis of the partial graph

Until the whole graph is processed we have two categories of pages:

  • Processed pages have outbound links. They are the core of the directional graph, and it is mostly easy to move around between them.
  • Pending pages are not processed yet. Therefore we do not have any outbound links. They are the edge of the graph, and while we can access them from the core, we are stuck if we get to any of these.

During the ingestion process, I kept track of the processed vs accessible pages. Accessible pages are all pages, processed or pending, that have been identified as accessible from our entry point.

The image shows (experimental results) the evolution of the pages during the ingestion process. So far I have processed 300,000 pages. The database contains 1,368,446 additional pages to a total of 1,668,446 entries. 30,659,409 links are recorded. According to Wikipedia (fr.wikipedia.org), there are today 2,003,413 articles.

In a wikirace, going to an accessible page is not an issue as at least one of the processed page gives us a link to it. The processed page are much fewer. They are the core of the graph we use. Moving around fast in this core is essential.

If the source page is not in the core you have to ingest it and hope to find a link to the core. This rarely proves to be a problem. On the other hand, if the target page is not listed there is nothing we can do.

Some examples

Conclusion

As expected, our engine finds paths where no human would dare to look (especially using dates). The diameter of the core must be very low. It is very rare to see any race use more than 8 links. Now that we have ingested the graph it will be easy to get statistics.

The code is available on my GitHub: https://github.com/Amxx/WikiMap2

--

--