Fast Wikipedia traversal algorithm and its applications in NLP and keyphrase extraction

Published in

Udemy Tech Blog

11 min readFeb 2, 2021

Udemy is home to over 155,000 courses taught by 70,000 real-world experts in over 65 languages. As you can imagine, this makes search and recommendations very important to help ensure the 40 million learners on Udemy can easily navigate the platform and discover the courses most relevant to the skills they wish to learn. To improve the accuracy of search and recommendations, an ongoing project at Udemy is tagging lectures of courses with relevant keyphrases.

While lecture titles can reveal such keyphrases, relying solely on them may not be the correct approach as the title may contain buzzwords or topics that are not deeply explored within the course content. To mitigate this, we collect the bookmark data that has been added to lectures by the learners. Lecture bookmarks on Udemy generally correspond to the beginning of a new topic, where the first few sentences after the bookmark usually repeat important keyphrases on the topic or skill. We then extract the caption data in a window after each bookmark. There is a suite of algorithms for keyphrase extraction, but most of these algorithms are imperfect because they return false positives (phrases that satisfy the conditions of the algorithm but are not relevant keyphrases). For example, let’s assume a keyphrase extraction algorithm returns the following keyphrases for a particular lecture of a course in the Data Science category:

Instruments
Artificial Neural Network
Backpropagation
Hidden Layer
Artist
Softmax Function

Instruments and Artist (false positives) are not related to Data Science, whereas the rest of the phrases are all related to Data Science. Our task now is to filter out these false positives to provide a more accurate set of keyphrases.

To help with this, we turn to an unlikely source: Wikipedia. We decided to use the vast quantity of information on Wikipedia to filter out false positives in a given list of keyphrases. This is because Wikipedia allows us to quantify and compute the relatedness of each input keyphrase with a known course category (e.g., Data Science). False positives can then be detected as less relevant to the course category. Such methodology can be used in any tagging project or just to measure the relatedness of any two keyphrases.

In this blog post, we are going to introduce a very fast algorithm for querying the relatedness of any two Wikipedia articles provided as input.

Wikipedia Knowledge Graph

Wikipedia is a treasure trove of information as it contains so much data on literally every topic under the sun. As a result, Wikipedia is the first stop for almost everyone whenever information is sought on a topic. Interestingly, Wikipedia can be used for many other purposes since the well-formed structure in the Wikipedia corpus allows extraction of interesting underlying facts about the language.

One way to use Wikipedia is to study the underlying network induced by the articles. In Wikipedia, each article has several hyperlinks that direct to other related articles. All of these relations form a graph where each article of Wikipedia corresponds to a node. There exists a directed edge from node A to B if the article corresponding to node B can be reached via a hyperlink present in the article corresponding to node A. We will call this graph a “unidirectional Wikipedia graph” since the edges are directed. Similarly, another graph is formed where there exists an undirected edge between nodes A and B if the article corresponding to one can be reached via a hyperlink in the article corresponding to the other node and vice versa. We will call this graph a “bidirectional Wikipedia graph” as the edges are undirected. In the figure shown above, the left diagram depicts a unidirectional graph while the right diagram illustrates a bidirectional graph. Notice that in the left diagram (unidirectional graph), Artificial Intelligence and Data Science are not connected to each other whereas they are connected via Deep Learning in the right diagram (bidirectional graph).

It is clear that the bidirectional graph has a symmetric adjacency matrix and significantly more edges than the unidirectional graph but nevertheless, both graphs provide information regarding the relatedness of any two Wikipedia articles. More precisely, articles that are close must be related to each other while articles that are far apart are unrelated. Extracting this graph and processing it so that for any two nodes, the ability to query the degree of separation (length of shortest path) and the number of shortest paths with very fast responses was important for our application. Our goal is similar to Six Degrees of Wikipedia and Degrees of Wikipedia but we want to use this application offline with far more tunable hyperparameters, added features, and fast responses. To achieve this, we built and used the entire Wikipedia Knowledge Graph from scratch!

Building the Wikipedia Graph

Flow diagram to create Wikipedia graph from Wikidumps

We can use the latest Wikidumps to create the Wikipedia graph from scratch. Note that Wikidumps contains Wikipedia in the form of compressed .bz2 files that become humongous on uncompressing. Thankfully, it is possible to process Wikidump files without uncompressing them, and to help do this, here is a great tutorial on how to programmatically download and parse the Wikipedia articles without uncompressing. We download all the pages-articles.bz2 files and use the mwparserfromhell package in order to parse each Wikipedia article and recover the hyperlinks from the article.

Often, keyphrases appear as bold or italic within an article but do not appear as a unique Wikipedia article. For example, “hidden layers” appears as an italic term in the article “Artificial Neural Networks” but there is no Wikipedia article entitled “hidden layers” even though it is a technical phrase and can be a potential keyphrase. Therefore, using regex functions, for each article, we also extract the bold and italic phrases in the article. Both unidirectional and bidirectional Wikipedia graphs can be extended by the addition of the bold and italic phrases as nodes and adding undirected edges with the node of the article in which they are present. We use this information to create the adjacency matrices of the Wikipedia graphs that are stored in the form of a scipy sparse matrix in .csr format. Notice that the adjacency matrices of both Wikipedia graphs are sparse and therefore scipy.sparse.csr is an extremely memory-efficient way to store the adjacency matrices. We also store a bijective dictionary that maps the row index/column index of the sparse adjacency matrix to the title of the Wikipedia article and vice-versa.

As we can clearly see from the average degree, the adjacency matrices of the extended Wikipedia graphs (both unidirectional and bidirectional) are sparse and scipy.sparse.csr is well suited for storing the matrices.

Querying the shortest paths between nodes

Now, we come to the most interesting part of this project! We will define the degree of separation between two nodes A and B to be the length of the shortest path between A and B. We want to make fast queries of the following type from either the extended unidirectional or extended bidirectional Wikipedia graph between two nodes A and B that are provided as input.

What is the degree of separation between A and B, and how many paths of shortest length exist between A and B? (Query 1)
What is the degree of separation between A and B, and what are the exact paths of shortest length between A and B? (Query 2)

Evidently, the second query is strictly more difficult than the first query as we can simply count the exact paths to answer the first query. Therefore, we expect to answer the first query in a relatively shorter time than the second query. So how do we approach this problem?

An initial idea is to simply store the required information between all pairs of nodes so that the queries can be answered in O(1) time. This is extremely memory expensive since there will be around 500 trillion pairs of nodes in Wikipedia. Hence this approach is infeasible.

Our approach to solving this problem is to make query responses on the fly i.e. given the source and target as input, we will design a graph search algorithm that uses the stored adjacency matrices to compute the degree of separation and the shortest length paths between the source and target.

Breadth-first Search from A (source): We can call A the source and B the target. This matters in the extended unidirectional Wikipedia graph as the edges are directed and therefore the answer to the query with source A and target B is not the same as the answer to the query with source B and target A. Our first attempt will be to implement a breadth-first search (BFS) starting from source until we find the target as illustrated in the gif below. During the implementation, care must be taken so that infinite loops are avoided.

In order to answer the first query (number of shortest paths between source and target), we track the number of shortest paths from source to each of the k-hop neighbors of A via dynamic programming. This is possible because for any k-hop neighbor X of source:

Number of shortest length paths from source to X = sum( Number of shortest length paths from source to Y such that Y is a (k-1)-hop neighbor of source and X is a one-hop neighbor of Y)

This simply means that the number of shortest paths from source to X is the sum of the shortest paths from the source to those visited nodes whose neighbor is X.

The second query can also be computed similarly by tracking the ancestors of each node.

Below, are examples of the time taken for some queries (Query 1) using BFS:

Clearly, the time taken for the above example queries is too long. For a degree of separation of 3, a query response time of 42 mins for a standard query (source: Book target: Data Science) is simply unacceptable.

To improve the efficiency of our algorithm we need to reduce the search space.

Bidirectional Search

As illustrated in the below gif, the idea is to search the neighbors from both source and target in alternate iterations until we find some common nodes.

Note that in the extended unidirectional Wikipedia graph, this process involves quickly looking up:

While searching from source: the neighbors of a particular node
While searching from target: the set of nodes whose neighbor is a particular node.

It is possible to lookup both of these quickly by using the scipy.sparse.csr and spicy.sparse.csc formats.

Just like the previous section, we keep track of the number of shortest paths from source to nodes visited from source and also track the number of shortest paths from target to nodes visited from target. Once we find an intersection, we compute the number of paths of shortest length between source and target as follows:

Total paths = sum ( Paths from source to X * Paths from X to target)
where the sum is over each node X in the intersection.

Again, the exact paths for such queries can be computed similarly by keeping track of the ancestors of each node. Finally, our algorithm returns the degree of separation between the source and target and the number of shortest paths between them. As illustrated with plot here:

An illustrative example of degree of separation (4) and total shortest path (18) between Data Science (Source, Green) and Instrument (Target, Red).

The following table shows the time taken using the bidirectional search algorithm compared to the BFS algorithm.

Comparison of performance of BFS and Bidirectional Search

This is a huge improvement! The bidirectional search algorithm clearly shows significant improvement in time complexity. This is unsurprising as theoretically the time complexity of the bidirectional search is O(b^(d/2)) as opposed to O(b^d) in BFS where b is the maximum out-degree in the Wikipedia graph and d is the degree of separation between source and target.

Now, we will show summary statistics to illustrate the drastic improvement in the time taken if the bidirectional search algorithm is used instead of the vanilla BFS algorithm. Note that this improvement is for the degree of separation >1 since both algorithms are exactly the same when the degree of separation between input and output is just 1. The experiments are conducted in the following manner:

We take 200 random pairs of Wikipedia articles each with a degree of separation 1,2 and 3 respectively for the Extended Unidirectional Wikipedia Graph.
We take 200 random pairs of Wikipedia articles each with a degree of separation 1 and 2 respectively for the Extended Bidirectional Wikipedia Graph.
We take 3 pairs of (source, target) with a degree of separation 4 for the Extended Unidirectional Wikipedia Graph.

For each instance, we compare the average time (over 200 examples) taken by the bidirectional search and the BFS algorithm respectively. All the experiments have been processed in a machine with a 2.3 GHz 8-Core processor.

**Extended Unidirectional Wikipedia Graph**

**Extended Bidirectional Wikipedia Graph**

**Extended Unidirectional Wikipedia Graph (4 degree of separation)**

What We Learned

And that is how you create Wikipedia graphs and how they help us track and tag keyphrases to help our learners discover the skills and topics most relevant to them. It’s a lot of information, so to recap: this blog post showed a bidirectional search methodology by which the degrees of relatedness between any two Wikipedia keywords can be calculated in a very efficient manner. Having such a fast algorithm enabled us to add even more nodes to the graph such as bold and italic words in the Wikipedia graph. And we learned how implementing this algorithm is an exponential improvement over traditional BFS.

I hope you found this unconventional use of Wikipedia as eye-opening as some of the countless obscure facts that are found on Wikipedia articles. If so — you’re in for a treat as in subsequent publications, we will share how using Wikipedia graphs can drastically improve keyword extraction, and we are going to open source the software. Stay tuned!

Acknowledgments

Thanks a lot to Hamed Hasheminia for providing exceptional guidance throughout the project. I also want to thank Marni Deshong, Gulsen Kutluoglu, Chuong (Tom) Do, Bixing Yan and Sam Cohan for their invaluable feedback.