Similarity Between Wikipedia Articles

Published in

INST414: Data Science Techniques

6 min readApr 20, 2023

For this project, we were tasked with collecting results for at least three “query items” from a Web-based data source and ranking them based on how similar they are. I chose “Computer Science”, “Neuroscience”, and “Mathematics” because I have insight into each of these domains, and they have some interesting overlap. These results and methods could be incorporated into building the training dataset of Intelligent Tutoring Systems (ITS) to enhance their ability to bridge gaps between several topics in a domain, assuming items that are similar to others in a given set are representative of all items in the set to some degree. This technology could be useful for teachers as a supplement to in-class learning, academics trying to bridge the gap between their field and another, or businessmen trying to understand how their business model would translate into another industry.

Sources and Tools

The primary tools used for this project were the wikipedia API, the pandas data analytics library, the hashlib library, the scikit-learn machine learning library, and the networkx graph analysis library:

The wikipedia API is a MediaWiki API wrapper that supports various HTTP methods (i.e., GET, POST, and UPDATE) on Wikipedia servers and was used for the retrieval of Wikipedia articles.
The pandas library is a popular data analysis library and was used to store data in tabular format.
The scikit-learn library is an industry-level machine learning library and was used to perform natural language processing operations and to calculate the similarity between Wikipedia articles.
The NetworkX library is a popular graph creation and analysis library, used to create all the graphs in this project. This library was used because of its simple and powerful interface, and its support for graph metadata.
Gephi is an open-source graph visualization software, and was used to visualize interesting properties of the graphs that were generated.

Data Cleaning

The first step was collecting the data from the API. As I was working with several query items, I created functions that retrieved the data using the wikipedia API, and handled any errors in collecting the data:

import hashlib 
import pandas as pd 

def get_hash_id(k): 
    """Returns a 10-byte hash ID for a given string."""
    bytes_ = k.encode("UTF-8")                  
    hash_id = hashlib.sha1(bytes_).hexdigest() 
    hash_id = hash_id[:10]  
    return hash_id

def get_articles(query):
    """Returns a DataFrame with articles on the specified query, q."""
    # track titles that failed to load content
    count = 0
    # create DataFrame to store title, and content
    df = pd.DataFrame(columns=["title", "content"])
    # get the titles from Wikipedia
    titles = wiki.search(query, suggestion=True, results=100)[0]    
    # iterate through titles, store the content for each one in the DataFrame 
    for title in titles: 
        hash_id = get_hash_id(title)
        try: 
            content = wiki.page(title, auto_suggest=False).content
        except Exception: 
            content = ''
        df.at[hash_id, "title"] = title
        df.at[hash_id, "content"] = content
    return df

# Get search for "Computer Science", "Neuroscience", and "Mathematics" from
# Wikipedia
comp_sci = get_articles("Computer Science")
neuro_sci = get_articles("Neuroscience")
maths = get_articles("Mathematics")

The second step in the process was converting the Wikipedia articles from strings to string vectors, as is the format required by scikit-learn to complete the calculations and document similarity. scikit-learn provides the TfdifVectorizer to easily tokenize (convert a string to an integer), and vectorize (convert a document to a vector of integers) documents.

Once the documents were tokenized and vectorized, I used Cosine Similarity to calculate the similarity between documents. I chose this metric because it has the simplest implementation and because it provides a good estimation of how similar the documents are. However, it doesn’t account for the magnitude of the vectors, and it doesn’t leave room for proper semantic analysis, as full statements are lost in converting pieces of text to vectors.

def similarity_matrix(corpus, column=None): 
    """
    corpus: a pandas DataFrame that contains the documents.
    column: the column to be used for pairwise comparison. 
    
    Returns, a DataFrame with pairwise comparisons (cosine similarity) for each document. 
    """
    
    docs = corpus[column].to_numpy()                            # store the relevant documents
    tfidf = TfidfVectorizer().fit_transform(docs)               # vectorize the documents
    pairwise_similarity = tfidf * tfidf.T                       # compute the pairwise cosine similarity 
    pairwise_similarity = pairwise_similarity.toarray()         # convert to numpy 2D array
    df = pd.DataFrame(
        pairwise_similarity,
        index=corpus.index, 
        columns=corpus.index
    )
    df["avg_sim"] = list(map(lambda r: sum(r)/len(r), pairwise_similarity))
    return df

The final step was storing the cosine similarity matrix for each of the result sets in variables that could be referenced later.

# Get the similarity matrix for each of the query item result sets
comp_sci_sim_matrix = similarity_matrix(comp_sci, "content")
neuro_sci_sim_matrix = similarity_matrix(neuro_sci, "content")
maths_sim_matrix = similarity_matrix(maths, "content")

Graph Construction

The graphs for visualizing the network were created using the networkx and pandas libraries. To streamline the process, I created the create_graph function to encapsulate the necessary steps to create the graph.

The create_graph the function takes in two DataFrames: df, which contains metadata and data of the documents, and sim_matrix, which contains pairwise comparisons for each document in df. It creates a graph using NetworkX and adds nodes to the graph for each document in df, then it adds edges to the graph between nodes for documents.

def create_graph(df, sim_matrix): 
    """
    df: a pandas DataFrame containing the metadata and data of the documents.
    sim_matrix: a pandas DataFrame containing pairwise comparisons for each document in 'df'.
    """
    ids = df.index
    g = nx.Graph()
    for left_node in ids: 
      # get this documents average similarity to others 
        avg_sim = sum(sim_matrix.loc[left_node])/len(sim_matrix.loc[left_node])     
        # add the node to the graph
        g.add_node(
            left_node, 
            title=df.loc[left_node]["title"], 
            avg_sim=sim_matrix.loc[left_node]["avg_sim"]
        )
        # add edges to the graph
        for right_node in ids: 
            if left_node != right_node: 
                sim = sim_matrix.loc[left_node, right_node]
                g.add_edge(left_node, right_node, similarity=sim)    
    return g

Once the graphs were created, the next step was to convert each graph into.graphml files, which is the file format required by Gephi.

nx.write_graphml(
  create_graph(comp_sci, comp_sci_sim_matrix), 
  "comp_sci.graphml"
)

nx.write_graphml(
  create_graph(neuro_sci, neuro_sci_sim_matrix), 
  "neuro_sci.graphml"
)

nx.write_graphml(
  create_graph(maths, maths_sim_matrix), 
  "maths.graphml"
)

The converted files were imported into the Gephi, and the resulting visualization is listed below. Each graph visualization is listed with the Top 10 most relevant items from the result set for each query item. Relevance is based on the average cosine similarity metric:

Computer Science:

Neuroscience:

Mathematics:

Bugs and Limitations

This project has the following limitations:

It only considers three query items.
The similarity metric used, Cosine Similarity, doesn’t account for the magnitude of the vectors nor does it consider proper semantic analysis.
The graphs created only show pairwise relationships between documents and do not account for the context of the entire dataset.

Conclusion

This article described a project that involved collecting results for at least three “query items” from a Web-based data source and ranking them based on how similar they are. The results of this project could be useful for building the training dataset of Intelligent Tutoring Systems (ITS) to enhance their ability to bridge gaps between several topics in a domain. The project used various tools such as scikit-learn and networkx to perform natural language processing operations and to calculate the similarity between Wikipedia articles. The resulting visualizations were created using Gephi. The code and data used to complete this project can be found in the following GitHub repository: Wikipedia-article-comparison.

In conclusion, this project provides a foundation for future research that could expand the number of query items and utilize more sophisticated similarity metrics to enhance the accuracy of the results.

All the code and data used to complete this project can be found in the following GitHub repository: Wikipedia-article-comparison.