Complex Network Analysis: Wikipedia Map of Science

Lucas Martiniano
The Startup
Published in
4 min readDec 18, 2020

Data Science project about complex network analysis using Python and Gephi, made as an assignment at my college course at Universidade Federal do Rio Grande do Norte.

Complex Networks are Graphs with no regular topology.

Wikipedia Science Articles Main Clusters

Data Source and the Network

This study case approaches the connection of Wikipedia Science pages. The network data of this relation was extracted from the English Wikipedia in early 2020 and it’s publicly available at Netzschleuder. The articles are linked if the cosine similarity of the page content is above a preset threshold. So, the network nodes are the Science Wikipedia pages and the edges are formed based in the content page similarity. This relation generates a unidirectional network wich means that the connection between nodes have no direction at all.

Visualization

It is possible to visualize complex networks using only python with modules such as Networkx and Matplotlib. However, to accomplish this, a lot of hardworking is necessary and, depending on the graph, the result isn’t very good in sight:

My failed attempt

Gephi

Gephi is an open-source and free software for managing graphs and networks. It also offers different layout algorithms, clustering tools, sorting by node attributes, statistic insights and other many functions. In this study case, Gephi was used to create the cluster layout and color reparting based on different Science types: natural (green), formal (blue), social (red) and applied (dark gray). These types were already set at the original data source file. The result image is shown below. Labels were intentionally omitted to increase visualization and nodes size is bigger proportional to its degree.

Wikipedia Science Pages Clusters — Natural (green), Formal (blue), Social (red) and Applied (dark gray)

Analysis

In order to realize how the connections occur in the network, some calculations were processed to verify the link relations measurements of each node and the network itself.

The charts below show the distribution of the node connections quantity, also known as Degree. The majority have 30 or less links while there are a lot of outliers with more than 60. This is a skewed distribution, which denotes a non-balanced number of connections in the network — some science pages are vastly connected and others are way too specific to relate with more topics.

Histogram and Boxplot Chart of Degree Distribution

Since the network data were classified based on different science groups, it is expected that nodes connections were arranged into clusters and then the clustering coefficient of all nodes to be more similar. This statement is actually true for the network as are shown in the charts.

Histogram and Boxplot of Clustering Coefficient

Network Centralities

So far, both measurements took place in individual nodes aspects and its distributions. Now, the impact in the whole network will be analysed to find out more relevant nodes. These nodes are in the graph centralities and there are different methodologies to identify them.

Degree Centrality

Simple indicator that consider the greater quantity of neighbors a node has. In this network, the Wikipedia Science page for School Psychology, from cluster Social, has more neighbors.

Closeness Centrality

This measurement evaluates how close the node is to the rest of the graph and ends up identifying the nodes most capable of affecting the whole network. Since the graph has a lot of clusters, the closeness centrality of the nodes is expected to be similar between each other as it is shown in the Boxplot below respective to this metric. “Computing”, from “Applied”, is the page with more closeness centrality of the network.

Betweenness Centrality

Measures, for each node, the minimal distance to reach all other nodes in the graph. Higher betweenness centrality indicates more intermediate connections and then more requirement relations, since the node is a common link to many others, increasing the network dependency on it. The page with bigger value for betweenness centrality is “Population biology”, in “Natural”.

Boxplot for the Distribution of centralities in the network

--

--