Visual network analysis with Gephi

Tutorial 03 in a series on controversy mapping

Ethnographic Machines
9 min readFeb 12, 2019

In this tutorial, we will cover the basics of doing a visual network analysis in Gephi. We will use the list of Wikipedia articles from the Circumcision category harvested in Tutorial 02 as a point of departure and consider different ways to think about and work with these articles as networks.

Gephi is a piece of open-source software that lets you visualize, manipulate and explore networks. A number of great introductions to the software are already available, such as this one or these.

Networks (a.k.a. graphs) can be stored in a variety of file formats (for example .gexf, .gdf, or .gml) but all of them consist of a list of nodes and a list of edges connecting the nodes. Visually speaking the nodes are the dots and the edges the lines between the dots, but nodes can represent any kind of entity and edges can represent any kind of relationship between two entities. Depending on the file format, nodes and edges can also have a variety of attributes associated with them (e.g. color, category, weight, etc).

Building different networks from Wikipedia articles

So, in order to build a network, we need to think about ways in which articles on Wikipedia can be said to relate to each other. The most obvious way is the direct hyperlinks that editors of an article provide to other articles on Wikipedia when relevant. In the screenshot below the editors of the ‘Circumcision’ article have decided to make in-text references to the articles about ‘foreskin’, ‘human penis’,’glans’,’circumcision device’, ‘anesthesia’, ‘physiologic stress’, ‘elective surgery’, and so forth.

Two types of hyperlinks in the ‘Circumcision’ article on Wikipedia, namely 1) in-text links to other Wikipedia articles (highlighted words in the text) such as ‘foreskin’, ‘human penis’ or ‘glans’, and 2) external references to relevant sources and literature (highlighted numbers in superscript and hard brackets).

Based on these hyperlinks to other Wikipedia articles, we can build a network where the nodes are articles like ‘circumcision’, ‘foreskin’, or ‘anesthesia’, and the edges are direct references between them. Below is an example. Two distinct clusters are clearly visible suggesting that articles in the ‘Circumcision’ category are thematically related in two distinct ways. We can draw that conclusion because we have constructed the network in such a way that an edge between two nodes reflects a thematic relationship between two articles (the link to the ‘foreskin’ article from the text in the ‘circumcision’ article has been put there by the editors because it is thematically relevant). These two groups of thematically related articles revolve around male and female circumcision (a.k.a. female genital mutilation) respectively. Rather than being part of the same issue, the debates about male and female circumcision seem to be associated with different people, different problems, and different events.

A network of articles from the ‘Circumcision’ category on Wikipedia connected to each other by hyperlinks. Nodes colored by Louvain modularity; graph layout by ForceAtlas2.

Question: Can you explain why the network of Wikipedia articles connected by hyperlinks (above) shows us how these articles are thematically related?

You can build the network yourself using this Python script (open as a notebook in Jupyter) which calls the Wikipedia API and asks for all links from a list of articles (you will need this .json file with the results of the category scrape from tutorial 02 as input). The script outputs two networks:

  1. A .gexf file with articles from the ‘Circumcision’ category connected to each other by wiki-links.
  2. A .gexf file with articles from the Circumcision category connected to each other AND TO ALL OTHER ARTICLES THEY CITE by wiki-links.

We will get back to how to open and visualize these .gexf files in Gephi. But before we get to that we should ask ourselves if the most interesting relationship to map between a large number of Wikipedia articles is always their hyperlinks to each other? What if, for example, we were interested in the sources they reference in support of their claims rather than the articles they point to for further reading? In accordance with Wikipedia’s editor guidelines, each article is equipped with a set of references to external sources (provided as footnotes with a reference list at the bottom of each article).

Example from the reference section of the ‘Circumcision’ article on Wikipedia.

This Python script calls the Wikipedia API and asks for all external references from a list of articles. It then builds a network (.gexf file for Gephi) where the articles on the list are the nodes (same as the first network) but the edges reflect the volume of external references shared by two articles (although the edges look like direct links from one node to another, they are in reality projected through a set of external references that both nodes are citing). The result is visualized below.

A network of articles from the ‘Circumcision’ category on Wikipedia connected by the degree to which they reference the same external sources. Nodes colored by Louvain modularity; graph layout by ForceAtlas2.

The first thing to notice is that the edges are now weighted (represented here as varying thickness). This was not possible in the first network of articles connected by hyperlinks since a link from one article to another can only meaningfully be counted once. In this case, however, two articles could have 5% or 95% of their external references in common, which in fairness should count differently in terms of how strongly the two articles can be said to be related. Such differences in relationship strength can is represented as differences in edge weight.

Question: Can you explain how two nodes are related to each other when they are connected by an edge in the network above? And can you explain why some edges get a different edge weight thann others?

The next thing to notice is that the same pattern of two clusters centered on male and female circumcision is found in this network as well. We could be tempted to say that the analysis of the network of circumcision articles connected by shared external references is, therefore, simply confirming what we already know from the analysis of the network of circumcision articles connected by direct hyperlinks. In reality, however, it is telling us something new. What we can say now, on top of the fact that articles about male circumcision point to different related topics than articles about female circumcision, is that articles about male circumcision draw on a different body of knowledge than articles about female circumcision. They simply reference a completely different set of reports and academic publications in support of what they state to be true.

Question: Can you explain how the network of articles connected by shared references is built differently than then network of articles connected by direct hyperlinks? And can you explain why the analysis of these two networks permit us to learn different things about the way circumcision debates are presented on Wikipedia?

Deciding how to build a network (i.e. what should count as nodes and edges) is never trivial. On the contrary, it involves important analytical choices. This Python script builds a network of editors connected to articles through revising them. It is yet a third way to build a network, this time bi-partial with both editors and articles as nodes, that will allow us to see if the same editors write all the articles about circumcision, or if there are different editor communities interested in different issues (male and female circumcision, perhaps). At the end of the day, it is up to us, the controversy mappers, to make a reasoned choice about what kind of relations we are going to use make what kind of claims about regions, communities or themes in the terrain we are mapping. Richard Rogers (2018) thus encourages us to not naïvely accept the metrics offered to us by tools and media platforms but to think critically about how we repurpose them to answer specific questions.

Some basic operations: Layout, coloring, node size

When you open a graph file the following import report should appear (I am opening the network of Circumcision pages connected by in-text links):

Opening a .gexf file (or any other graph file format) generates an ‘import report’. If you are happy with the information displayed here, just click ‘OK’.

Generally, you should be able to recognize the number of nodes and edges displayed here. Also, if Gephi encounters problems opening the graph file these will be displayed in the ‘Issues’ window. You will normally be able to open the file anyway. Click ‘OK’ if you are happy with the information you see. The graph will open and be displayed in a random layout (i.e. nodes placed randomly in space) such as the one below.

Initial random layout of the graph in the ‘Overview’ pane.

You can turn node labels on and off by clicking the big ‘T’ in the menu at the bottom of the ‘Graph’ window, and you can scale the label sizes by using the slider on the right side of the menu at the bottom of the ‘Graph’ window. If you want to see all the information for a specific node you can use the ‘Edit node attributes’ tool (cursor with a small question mark) located in the menu on the left side of the ‘Graph’ window. You can zoom in and out of the graph using the mouse scroll. You can also right-click and drag the graph to reposition your view.

Layout

In order to help us explore the structure of the graph and see clusters, bridges, and structural holes, we can use a force-directed layout algorithm. Force-directed layouts (or force vector or spring-based layouts) will push nodes apart from each. Edges between nodes will act as springs pulling these nodes together. Stronger edges (heavier edge weights) will act as stronger springs. We can choose the ForceAtlas2 Layout from the ‘Layout’ dropdown on the left side of the ‘Graph’ window:

Choosing a layout algorithm.

Before running the layout let us review some of the parameters available to us. Eventually, the idea is to iteratively keep tweaking these parameters as we see the result of the layout. This, then, is just the initial setup. ‘Scaling’ will control the size of the area over which your layout can spread. If you need more space between nodes, increasing the scaling is a good option. Gravity controls the degree to which nodes are pulled towards the center of the network. If you have parts of the network floating far away from the center, increasing gravity is a good option. Finally, if you have a cluttering of nodes overlapping each other you can turn on ‘Prevent overlap’. However, this option should not be used until the Layout is otherwise complete as it may prevent nodes from finding their place. When your parameters are set, click ‘Run’.

Setting the parameters of the ForceAtlas2 layout
The result of a ForceAtlas2 layout with Scaling 50, and Gravity set to ‘Stronger’ but decreased to 0.02.

Size

Given that this is a directed network where one page points to another page through a link it could make sense to visualize the most cited nodes. We can do that in the ‘Appearance’ pane to the left of the ‘Graph’ window. Select ‘Nodes’, ‘Ranking’ and the icon with growing concentric circles. From the dropdown, select ‘In-Degree’. This will size the nodes by the volume of incoming edges from other nodes in the network. Set minimum and maximum size for the nodes and click ‘Apply’

Setting node size in the ‘Appearance’ pane.
Nodes sized by in-degree

Color

Finally, let us add some color to the nodes. As is the case with sizing, we can color nodes by many different parameters. In this case, we will calculate the modularity of the graph and color the nodes by the resulting modularity classes. From the ‘Statistics’ pane on the right of the ‘Graph’ window, run ‘Modularity’. For now, we can use default settings. The algorithm will try to cut the graph into communities where nodes are strongly related to each other inside the community and weakly connected to nodes in other communities. Click ‘OK’ to run the algorithm.

Running the Modularity algorithm.

When the results are in we can open the ‘Appearance’ pane on the left side of the ‘Graph’ window and select ‘Nodes’, ‘Partition’ and the color palette icon. From the dropdown menu, select ‘Modularity Class’ and click ‘Apply’.

Coloring nodes by Modularity Class.
Nodes colored by modularity.

You have now learned to build different networks from articles on Wikipedia and to think about the analytical implications of these choices. You have also learned how to open these networks in Gephi and conduct basic visual network analysis by using a force-directed layout and the modularity statistic to find clusters, and by using simple centrality metrics to find important nodes by different criteria. In the next tutorial, you will learn how to work with timelines in Tableau.

--

--

Ethnographic Machines

“Traditional social science is on the lookout for variables; ethnographers are on the lookout for patterns” (Agar 2006, 109)