Climate Change: Wikipedia Citation Network
Complex network analyses and graph visualization using Python and Gephi
It’s no news that Wikipedia is a great hub for free information, written through an open collaborative process, improving the frequency of content production, which helps to build this so commonly used site. To be more precisely, the platform registered over 22 billion hits last month alone.
Assuming multiple revisions, Wikipedia content can be expected to express, at least to satisfactory level, a true picture of reality, making the channel a source of diversified subjects.
By instance, taking a certain Wikipedia page, it’s presumable that its content body links to other pages denotes a concrete relation between those subjects. Thus, bringing this scenario to a scope of complex network analises, it’s possible to treat pages as nodes and links between pages as edges. So, let’s create a graph by applying this behavior to analyse a specific subject and its relations to other topics.
Climate Change Network
Recently, the U.N Intergovernmental Panel on Climate Change (IPCC) published a worrisome report about the impacts of human activity at warming climate process that might be irreversible for centuries.
Assuming Wikipedia as a content source, it’s valid to take the Climate Change page as a seed for building a network of correlated subjects.
By scraping this page and its first-degree linked subjects, around 100 thousand pages are reached. A very large number. Nevertheless, by browsing these nodes, it’s noticeable the presence of duplicate pages (same name but in plural) or just really generic ones. Addressing this issue, we kept 96912 pages.
Presuming that more meaningful nodes are more connected, we can check their degrees (number of links) to ignore pages that are not really relevant. In this case, over 14k nodes have only one connection and a lot more have less than 10 links.
Since there are so many nodes, let’s proceed with pages that have 10 or more connections. This drastically lowers the number of nodes in the graph to just 3994. The result of this process is a Climate Change Wikipedia Citation Network.
What is this image?
Each dot is a Wikipedia page. Lines represent links on pages content body to other pages. Labels are only displayed for nodes that were more quoted and the size of each node is directly proportional to the number of links that lead to then. Meanwhile, colors denotes clustering patterns through out the network.
A way to visualize what each cluster (color) means is viewing more frequent words in page titles. For instance, take the wordcloud below for pink nodes at bottom left of the network image. It’s evident that these pages are somewhat related to the Natural Resources concept.
Applying this technique, these are the meaningful labels for each cluster:
- Pink (Bottom Left): Natural Resources
- Purple (Top): Geological Topics
- Orange (Left): Climate Change
- Dark Green (Right): Belief
- Cyan (Bottom): Environmental Impact
- Light Green (Center): Science and Ecology
Clusters
The following graph images shows only labels from nodes that were more quoted in each cluster. In graph terms, nodes that have a higher in-degree centrality. Despite the brief conceptions, try out to draw your own perceptions about the subgraphs structure and subjects corelation.
Climate Change
This cluster contains the Climate Change node itself. Despeites this page was used as seed for building the entire network, this group only contains around 16% of the whole graph structure. As expected, nodes in this cluster are topics more directed connected to climate changes. Instances of reasons (Greenhouse Gas), results (Sea Level Rise or Human Impact on Marine Life) and possible general solutions such as Reforestation.
Natural Resources
Discussions about renewable energy and natural resources indeed relates to the climate change. But, this cluster only contains 3.65% nodes of the network. Likely because of the finite number of known resources.
Interesting to note two subgroups. On the left, natural resources and well management behaviors and, on the right, worrying concepts that, besides they’re not connected, seems to be related. Also, unlike Fossil Fuel and Nuclear Power, no instance of renewable source of energy were highlighted.
Environmental Impact
It’s not for random that, bordering the previous cluster, it is the group composed of consequences of natural resources misuse. All topics in this cluster seems to be key indicators of the human fingerprint in the environment, concerns about the consequences and well management process that might revert it.
Science and Ecology
At the center, there is the light green cluster. Browsing through its nodes, it’s possible to verify a pattern of science fields pages and subjects about Earth itself and its niches. This cluster covers from History to Biology; almost all ecosystems and environment aspects like Temperature, Climate and biodiversity. The group position might be related to its diversity of knowledge fields pages and nodes that represent environment niches closely related to climate changes.
Geological Topics and Politic States
All politic states, countries, cities and almost all regions are grouped in this cluster. Therefore, this is the partition with highest number of nodes: over 25% of the entire network. Although it has less then 8% of the graph connections.
Notice that the only country that appears in highlight is United States. This can be due by several reasons. Since each node is a Wikipedia topic, the first explicit reason is because United States page was quoted more than other countries. Furthermore, the network indicates another possible cause. Among the nodes highlighted, some of U.S nearest neighbors are Greenhouse Gas Emission and Kyoto Protocol. In reality, United States accounts a relevant part of greenhouse gas emissions and are no longer following Kyoto Protocol since 2001. Due to all aspects, the country has a real impact on climate.
What about other nodes? Wayback Machine is a service for storing web pages, which does not relate at all with the project theme. It is just a really generic node and doesn't seem to be relevant. Meanwhile, Seven Seals is a page about a biblical story. Viewing the whole network, this node is clearly an important bridge between purple and dark green clusters.
Belief, Fears and Theoretical Myths
Despeites this cluster has only 12.42% of nodes from the network, it covers the biggest number of edges: around 25%. Its most quoted node is Global Warming and, among with Deforestation and Anoxic Event, stands for real-world phenomenons. But, another highly linked node is Climate Fiction (a genre of audiovisual productions based on drastically climate change conjecture) that is a human abstract concept.
Following the analyses, it’s noticeable a very densely niche of nodes connected in circular shape. In this conglomerate, all nodes sizes are similar, indicating a equivalent number of citations. In order to get more information about this subgroup, a new graph was made plotting only nodes that are in the niche. Exploring the labels, it is evident the great number of
religiosity concepts, instances of human belief and fear, directly related to apocalyptic events caused by mystical causes or natural disasters.
In order to expose this, take the wordcloud below made by more often words in page titles.
A great number of words are somewhat related to religiosity or commonly mystified subjects in real life.
Explore the network yourself!
Although it is interesting to dive into the network to find out some key indicators or take insights, it is also required to grep the big picture of the structure. That’s why an interactive version of the entire network is available at GitHub. Ready to be accessed on browser. Check it out!
Coding and Development Process
All data used in this study case was extract from Wikipedia using the Python programming language. As well as the whole treatment process and worcloud visualizations. Meanwhile, graph plots and metrics were generated using Gephi, an open source network analyses software. For more details, check out this repository that contains all assets, required files and the entire coding process need to replicate this study. Feel free to clone and try it out.
This project is part of the Network Analyses course at Universidade Federal do Rio Grande do Norte. Co-written by André Habib.