Visualizing Human Knowledge

A Graphical Exploration of Wikipedia


Visualizing Hidden Structure

Wikipedia is the product of approximately 100 million man-hours of contribution, and thereby a great wealth of human knowledge.

Furthermore, Wikipedia features exceptional contextual structure. An article, rather than a simple collection of words, contains a collection of links to related or prerequisite information.

Visual representation of a Graph illustrating nodes and edges

From this connectedness arises a very intriguing structure, which can be represented using a graph. Each Wikipedia page represents a node and the links to other pages would be an edge.

Graph analysis is a powerful and insightful tool when looking at large sets of interconnected data, such as social networks.

Below are some captivating examples of social graphs from around the web:


Facebook

A social graph of 500 million people and their connections, Visualizing Friendships by Facebook’s Paul Butler.

A study of Facebook’s social graph has shown that 99.6% of users are connected within 6 hops. (The Small-World Phenomenon theorizes that any two people are within Six Degrees of Separation.)


LinkedIn

A network graph from InMaps (image credit)


Now for my explorations of Wikipedia…


Wikipedia

I’ve used a similar approach to graph how one Wikipedia page links to other pages.

The previous image is a graph visualization for a given topic on Wikipedia. About 3,800 pages and 570,000 links between them.

The various colors represent groups of nodes (i.e. pages) that are more densely connected. This is known as Community Structure, or Network Modularity.

Similarly, the spatial layout is determined by an algorithm that organizes the nodes based on their connectedness. Force-directed graph drawing simulates spring-like attractive force of edges, and the repulsive force between nodes, which runs iterative calculations and converges on an aesthetic equilibrium.

The node color indicates Betweenness Centrality

The ‘Centrality’ of a node often determines the size of the nodes in a network visualization. Various measures exist, including Closeness Centrality, Eigenvector Centrality, and Betweenness Centrality.

Betweenness centrality measures the proportion of shortest paths a node (or edge) belongs to, which can indicate how much traffic or flow would be expected through a particular part of a network.

By analyzing the Wikipedia network, a hidden structure begins to emerge. Each colored community indicates a related topic. Technology, computation, linguistics, psychology — all emerge solely based on the interconnectedness of the individual articles.


The Motivation

Researching an unfamiliar topic on the internet can be challenging. Reading a Wikipedia page provides many references to other topics, which can quickly lead to confusion.

However, the underlying structure that emerges from links can help determine which topics are more important than others.

Since learning a new topic is based on prerequisite knowledge, the goal of this project was to identify a small collection of the most important articles.

It is impractical to read 3,800 articles on Wikipedia. This project tries to distinguish which linked articles are merely related, and which topics are crucial prerequisites.


Conclusion

This exploration has produced some quick, interesting results. However, it is far from complete. Wikipedia is just a small subset of human knowledge that has been transcribed to the internet.

The initial results have shown identified important articles, and more groups (representing general disciplines). However the current implementation only provides a ranking of ‘importance’. It lacks a sequence, or directionality (e.g. Topic X is a prerequisite of Y.)

Although this project has been a great introduction to the field of graph theory, there is still plenty to learn. My future endeavors will involve refining the methodology, and ultimately creating a tool that enables the discovery of underlying context and structure of knowledge in a given field.