Climate Change: Wikipedia Citation Network

Complex network analyses and graph visualization using Python and Gephi

Lucas Martiniano
Geek Culture
7 min readAug 31, 2021

--

Wikipedia citation network based on Climate Change page

It’s no news that Wikipedia is a great hub for free information, written through an open collaborative process, improving the frequency of content production, which helps to build this so commonly used site. To be more precisely, the platform registered over 22 billion hits last month alone.

Assuming multiple revisions, Wikipedia content can be expected to express, at least to satisfactory level, a true picture of reality, making the channel a source of diversified subjects.

By instance, taking a certain Wikipedia page, it’s presumable that its content body links to other pages denotes a concrete relation between those subjects. Thus, bringing this scenario to a scope of complex network analises, it’s possible to treat pages as nodes and links between pages as edges. So, let’s create a graph by applying this behavior to analyse a specific subject and its relations to other topics.

Climate Change Network

Recently, the U.N Intergovernmental Panel on Climate Change (IPCC) published a worrisome report about the impacts of human activity at warming climate process that might be irreversible for centuries.

Assuming Wikipedia as a content source, it’s valid to take the Climate Change page as a seed for building a network of correlated subjects.

By scraping this page and its first-degree linked subjects, around 100 thousand pages are reached. A very large number. Nevertheless, by browsing these nodes, it’s noticeable the presence of duplicate pages (same name but in plural) or just really generic ones. Addressing this issue, we kept 96912 pages.

Presuming that more meaningful nodes are more connected, we can check their degrees (number of links) to ignore pages that are not really relevant. In this case, over 14k nodes have only one connection and a lot more have less than 10 links.

Degree Histogram and Degree Histogram zoomed in

Since there are so many nodes, let’s proceed with pages that have 10 or more connections. This drastically lowers the number of nodes in the graph to just 3994. The result of this process is a Climate Change Wikipedia Citation Network.

Climate Change Wikipedia Citation Network. Only labels for more quoted nodes

What is this image?

Each dot is a Wikipedia page. Lines represent links on pages content body to other pages. Labels are only displayed for nodes that were more quoted and the size of each node is directly proportional to the number of links that lead to then. Meanwhile, colors denotes clustering patterns through out the network.

A way to visualize what each cluster (color) means is viewing more frequent words in page titles. For instance, take the wordcloud below for pink nodes at bottom left of the network image. It’s evident that these pages are somewhat related to the Natural Resources concept.

Wordcloud generated by pink page titles
Wikipedia Climate Change Citation Network

Applying this technique, these are the meaningful labels for each cluster:

  • Pink (Bottom Left): Natural Resources
  • Purple (Top): Geological Topics
  • Orange (Left): Climate Change
  • Dark Green (Right): Belief
  • Cyan (Bottom): Environmental Impact
  • Light Green (Center): Science and Ecology

Clusters

The following graph images shows only labels from nodes that were more quoted in each cluster. In graph terms, nodes that have a higher in-degree centrality. Despite the brief conceptions, try out to draw your own perceptions about the subgraphs structure and subjects corelation.

Climate Change

This cluster contains the Climate Change node itself. Despeites this page was used as seed for building the entire network, this group only contains around 16% of the whole graph structure. As expected, nodes in this cluster are topics more directed connected to climate changes. Instances of reasons (Greenhouse Gas), results (Sea Level Rise or Human Impact on Marine Life) and possible general solutions such as Reforestation.

Climate Change subgraph

Natural Resources

Discussions about renewable energy and natural resources indeed relates to the climate change. But, this cluster only contains 3.65% nodes of the network. Likely because of the finite number of known resources.

Natural Resources subgraph

Interesting to note two subgroups. On the left, natural resources and well management behaviors and, on the right, worrying concepts that, besides they’re not connected, seems to be related. Also, unlike Fossil Fuel and Nuclear Power, no instance of renewable source of energy were highlighted.

Environmental Impact

Environmental Impact subgraph

It’s not for random that, bordering the previous cluster, it is the group composed of consequences of natural resources misuse. All topics in this cluster seems to be key indicators of the human fingerprint in the environment, concerns about the consequences and well management process that might revert it.

Environmental Impact cluster worcloud

Science and Ecology

At the center, there is the light green cluster. Browsing through its nodes, it’s possible to verify a pattern of science fields pages and subjects about Earth itself and its niches. This cluster covers from History to Biology; almost all ecosystems and environment aspects like Temperature, Climate and biodiversity. The group position might be related to its diversity of knowledge fields pages and nodes that represent environment niches closely related to climate changes.

Geological Topics and Politic States

All politic states, countries, cities and almost all regions are grouped in this cluster. Therefore, this is the partition with highest number of nodes: over 25% of the entire network. Although it has less then 8% of the graph connections.

Notice that the only country that appears in highlight is United States. This can be due by several reasons. Since each node is a Wikipedia topic, the first explicit reason is because United States page was quoted more than other countries. Furthermore, the network indicates another possible cause. Among the nodes highlighted, some of U.S nearest neighbors are Greenhouse Gas Emission and Kyoto Protocol. In reality, United States accounts a relevant part of greenhouse gas emissions and are no longer following Kyoto Protocol since 2001. Due to all aspects, the country has a real impact on climate.

What about other nodes? Wayback Machine is a service for storing web pages, which does not relate at all with the project theme. It is just a really generic node and doesn't seem to be relevant. Meanwhile, Seven Seals is a page about a biblical story. Viewing the whole network, this node is clearly an important bridge between purple and dark green clusters.

Belief, Fears and Theoretical Myths

Despeites this cluster has only 12.42% of nodes from the network, it covers the biggest number of edges: around 25%. Its most quoted node is Global Warming and, among with Deforestation and Anoxic Event, stands for real-world phenomenons. But, another highly linked node is Climate Fiction (a genre of audiovisual productions based on drastically climate change conjecture) that is a human abstract concept.

Belief, Fears and Theoretical Myths Subgraph

Following the analyses, it’s noticeable a very densely niche of nodes connected in circular shape. In this conglomerate, all nodes sizes are similar, indicating a equivalent number of citations. In order to get more information about this subgroup, a new graph was made plotting only nodes that are in the niche. Exploring the labels, it is evident the great number of
religiosity concepts, instances of human belief and fear, directly related to apocalyptic events caused by mystical causes or natural disasters.

In order to expose this, take the wordcloud below made by more often words in page titles.

A great number of words are somewhat related to religiosity or commonly mystified subjects in real life.

Explore the network yourself!

Although it is interesting to dive into the network to find out some key indicators or take insights, it is also required to grep the big picture of the structure. That’s why an interactive version of the entire network is available at GitHub. Ready to be accessed on browser. Check it out!

Screenshot of the interactive network

Coding and Development Process

All data used in this study case was extract from Wikipedia using the Python programming language. As well as the whole treatment process and worcloud visualizations. Meanwhile, graph plots and metrics were generated using Gephi, an open source network analyses software. For more details, check out this repository that contains all assets, required files and the entire coding process need to replicate this study. Feel free to clone and try it out.

This project is part of the Network Analyses course at Universidade Federal do Rio Grande do Norte. Co-written by André Habib.

--

--