Network of Wikipedia Links

Yang Lu
INST414: Data Science Techniques
4 min readFeb 24, 2022

A non-obvious insight I wanted to extract from my data was how connected different topics are to each other. It is an inspiration of the Theory of Everything, which seeks to describe the interactions between forces. I wanted to see how topic A would interact with topic B, measured by whether they can be linked through wikipedia. A technical principle based on the The Wiki Game. This insight does not have much information to add to decision making. It is more of a tool used for fun exploration. At most, a decision that could be made would be deciding whether or not you could win a bet that you can make a connection between UMD College Park and Earth Day, or some other random topic.

The source of my network data is from an unofficial Wikipedia API. The nodes represent Wikipedia pages with University of Maryland as the origin, and the edges represent that the two are connected by hyperlink on the source’s wiki page.(Source, Destination format). This also means that the network is directed. I gathered the data, stored it as a tuple in a list, read the list to networkx, saw that there were a large amount of data, cleaned up the data by removing dupes, and filtered nodes by if their indegree was greater or equal to 10. The result was a network with at least 2000 nodes. These 2000+(7.39% of the total data) nodes represent 88000+(44.4% of the total data) of the edges in this network. Which is somewhat expected as I filtered by indegree. Due to the size, I made the graph in Gephi after processing it in NetworkX.

I used the OpenOrd layout in Gephi as it was the fastest layout. Then running it through Noverlap to make it more readable. The structure of this graph due to the large size, is a essentially a hairball. Interestingly, we can see a couple of clusters.

In looking at the giant cluster in the middle, I ranked the nodes by in-degrees, from light to dark green.

We see that majority of the nodes in this cluster have little amounts of in-degrees. In looking at the two dark greens, one is VIAF, and the other is University of Maryland, College Park. After some research, VIAF appears to be an identifier used by Wikipedia. Which means, I needed to do a more extensive data cleaning next time. The University of Maryland, College Park(414 indegrees, 808 outdegrees)was an expected important node as it was the original wiki page used to get the rest of the data. There were also some important nodes not in cluster.

This node was United States, with 202 indegrees and 1162 outdegrees.

Doing some random investigation such as “What is the shortest path between Walt Disney Company and University of Maryland, College Park?”, the analysis returned UMD->Indiana University, Bloomington->Walt Disney Company.

This originally made no sense as Indiana University was not a link on the UMD wiki page. However, after some research on Wikipedia, I found that the connection was between UMD->Public Ivy->Indiana University. Public Ivy was a link on the UMD wiki page, However it may have been removed in the filtering process. The connection between Indiana University, Bloomington and Walt Disney Company was Bob Chapek, the CEO of Disney and alumnus of the university. This accidental filtering of Public Ivy shows a limitation of the compromise between large dataset analyzation and a laptop not powerful enough to handle the analyzation. There were also nodes with no connections to some other nodes, such as Walt Disney Company and most fraternities and sororities

Some bugs I encountered was that some links may not exist/wikipedia’s red link. So I made the program skip over these red links.

The main takeaway from my network analysis is that filtering the data can sometimes get rid of unknown but necessary data, and that networks can show that there is a relation between the nodes but it does not show how it relates, as shown in the UMD->Walt Disney Company example.

--

--