The Practicality of a Wikipedia-based Search Engine

Published in

INST414: Data Science Techniques

6 min readApr 10, 2023

Introduction

Wikipedia is a free online encyclopedia with over 60 million articles, written in more than 300 languages. As one of the most popular sources of information on the web, Wikipedia acts as its own network of information. To improve user experience, a search engine using natural language processing (NLP) could be implemented to speed up information retrieval on Wikipedia. But, this raises the following question: Is a Wikipedia-based Search Engine practical? The objective of the analysis is to identify patterns in data returned by Wikipedia’s default search tools that establish a precedence for the development of a Wikipedia-centric search engine, and prove that such an engine is, in fact, practical.

Sources and Tools

The data for this project was collected directly from Wikipedia, with only topics related to “LeBron James” being analyzed due to computational constraints. The project utilized several Python libraries, including the wikipedia library for data collection, the pandas library for data manipulation and analysis, the scikit-learn library for data processing and classification, the NetworkX library to create the graph model, and the Gephi software for generating the final visualizations.

The wikipedia library was used as the primary tool for gathering information from the Wikipedia website. The pandas library was used to store data in a tabular format, and to manipulate the structure of the data. The scikit-learn library was used to calculate the similarity between documents and to define the edges of the graph. The NetworkX library was chosen for its simple and powerful interface and support for graph metadata, and was used to create the models for the graph. Finally, the Gephi software was used to visualize the graphs generated by the analysis.

Data Collection and Cleaning

The first step in the project was to collect the LeBron James data using the wikipedia API. The following code snippet returns the first 100 article titles that the current Wikipedia search tool will return. The query parameters are: suggestion, which informs the tool whether or not to augment our search query to match some records existing in the database; results, which is simply the number of records we want returned.

import wikipedia as wiki

# Returns a list of page titles related to Lebron James
query = "LeBron James"      
titles = wiki.search("LeBron James", suggestion=True, results=100)[0]

Wikipedia articles returned when searching for “LeBron James”

The second step in the data collection process was storing each of the article’s content in an accessible way. The following code snippet uses the DataFrame object from the pandas library to store the content of the article, along with some metadata. Furthermore, it uses the hash-value of the title of the article as the index.

import pandas as pd
import hashlib 

# Create a DataFrame to store the 'title' and 'content' 
pages_df = pd.DataFrame(columns=["title", "content"])

# Iterate over the titles; store the title and page content in the DataFrame
for title in titles: 
    bytes_ = title.encode("UTF-8")                  # get the bytes representation of the string
    index = hashlib.sha1(bytes_).hexdigest()        # get the hash-value of the title
    index = index[:10]                              # truncate the hash-value

    # Get the content of the article 
    page_content = wiki.page(title, auto_suggest=False).content

    # Store the article contents in the DataFrame
    pages_df.at[index] = [title, page_content]

DataFrame containing contents of Wikipedia articles related to LeBron James

The raw data collected using the wikipedia API was already formatted and ready for analysis. Therefore, no further cleaning was necessary on the content of the Wikipedia articles themselves.

Data Analysis and Graph Construction

The first step in the data analysis process was data preprocessing. Each wikipedia page had to be tokenized and vectorized before the actual analysis. This was accomplished using the TfidfVectorizer module available in the scikit-learn library. Once the corpus was processed, the next step was calculating the cosine similarity of between each pairing of documents in the corpus. Cosine similarity is a metric that allows us to measure the similarity between two non-zero vectors, defined in an inner product space. There are other metrics that can be used to quantify the similarity between documents, but cosine similarity is the most simplistic and readily understood metric.

from sklearn.feature_extraction.text import TfidfVectorizer

# get the documents we want to compare for similarity
corpus = pages_df["content"]                      

# vectorize the documents
tfidf = TfidfVectorizer().fit_transform(corpus)     

# compute the pairwise cosine similarity 
pairwise_similarity = tfidf * tfidf.T

The second step was to store these pairwise comparisons in the DataFrame that would be used to create the graph.

# create DataFrame to store cosine similarity values
cosine_similarity = pd.DataFrame(                   
    pairwise_similarity.toarray(), 
    index=pages_df["hash_id"], 
    columns=pages_df["hash_id"])

Cosine Similarity Matrix for Each Pair of Wikipedia Articles

The third step of the data analysis process was create a network of these documents to identify to determine those with the highest relationships. The NetworkX library was used in tandem with pandas to create the network of articles.

The nodes of the graph represent the individual articles returned by the current Wikipedia search tool. Because the project is about creating a tool that is based on the similarity between articles, it was logical that the nodes of the graph would be the articles themselves. A node’s importance is based on its average cosine similarity score; the more similar an article is to other articles, the more important that article is.

The edges of the graph represent the cosine similarity measure for any two given articles. As stated previously, there are other metrics available to measure the similarity between documents, but cosine similarity is the most convenient for the scope of this project.

# import the NetworkX library used to create the graph
import networkx as nx 

pages_df.index = pages_df.hash_id
pages_df = pages_df.drop(labels=['hash_id'], axis=1)

# create a nx.Graph object 
pages_graph = nx.Graph()
page_ids = cosine_similarity.index.to_list() # get list of article IDs

# iterate over the page IDs
for left_node in page_ids: 
    # add the article and its title to the graph
    pages_graph.add_node(
      left_node, 
      title=pages_df.loc[left_node]["title"],
      average_similarity=cosine_similarity.loc[left_node]["average_similarity"]
    )

    # iterate over the other page IDs
    for right_node in page_ids:
        # avoid self-loops (when a node has an edge to itself)
        if left_node != right_node: 
            # add the node and its edges to the graph (cosine similarity score)
            pages_graph.add_edge(
              left_node, 
              right_node, 
              cosine_similarity=cosine_similarity.loc[left_node, right_node]
             )

# Sample node:edge 
print(f"f511669021 -> 972d8cef69 - Cosine Similarity: {pages_graph.get_edge_data('f511669021', '972d8cef69')}")
for n in pages_graph.nodes.data(): print(n)

Sample of Nodes from the “LeBron James” Wikipedia Graph

The third and final step was visualizing the network. This was done using Gephi, which is a graph visualization and manipulation network. The final visualization is shown below:

Network of Wikipedia Articles on “LeBron James”

The nodes with the highest contrast are the articles that have the least similarity with other documents in the graph. In the results set, however, are amongst the first dozen returned by the current Wikipedia search tool. This shows that the current Wikipedia search tool might not be returning the most relevant article to its results page.

Bugs and Limitations

The primary bug faced during this project was during the data cleaning phase. I had tried to manually tokenize and vectorize the data, but this led to inefficiencies in computation, and caused processing time to increase exponentially. Fortunately, the scikit-learn library offers a class that abstracts this process, making the processing more efficient.

The biggest limitation faced during this project was selecting appropriate metrics for measuring how similar documents are. The bag-of-words approach removes semantic properties that are relevant to the document being studied, therefore, an enormous amount of information was lost during the vectorization process, which skewed the results of the article comparisons.

Conclusion

The objective of this project was to investigate the practicality of a Wikipedia-based Search Engine. Simply stated, the results of this test are inconclusive as their as several other approaches to this problem that were not addressed in this project. However, a relevant finding was the inefficiency of the current Wikipedia search tool; the current tool doesn’t seem to return the most relevant articles, from a bag-of-words perspective. Thus, whilst a Wikipedia-based Search Engine should still be considered, the current search tool would benefit from improvement.

All the code and data used to complete this project can be found at the following GitHub repository: developing-a-wikipedia-search-engine.