How to create network visualisations with Gephi: A step by step tutorial
In this tutorial, I will give you a brief introduction to Gephi, a network visualisation tool. This guide will also briefly explain why visualising networks can be useful and when they can be used.
This tutorial will cover:
- A brief explanation of networks & visualisation
- Installation of Gephi
- Data Preparation
- Importing Data into Gephi
- Gephi’s Main Functions
- Network Layout & Visualisation
- Customisation
- Further Resources
What is a network?
A network is a collection of interconnected points or entities. A network visualisation is a way of illustrating the relationship between those points and entities.
Network visualisations are often used to represent and identify complex relationships in a more easily comprehensible way.
Networks also help in mapping out and analysing how different elements interact with each other within a given system, for example do two points interact strongly or weakly, does one point exert more influence than another etc.
Most networks, regardless of their specific type or purpose, are composed of two basic components: nodes and edges.
Nodes are the fundamental units of a network, representing the entities or points within the system.
Edges are the links or connections between nodes in a network. They illustrate the relationships or interactions between points/entities.
Edges can be either directed or undirected; directed edges have a specified direction indicating the flow or relationship from one node to another, while undirected edges represent bi-directional or non-specific relationships between nodes.
For example, in a social media influence network, a directed edge would represent the influence of one user to another (say an actual influencer and a follower) while an undirected edge indicates a more balanced relationship between individuals (such as two Facebook friends).
Why use Gephi to visualise networks?
Gephi is an open-source software that helps you analyse and visualise your networks. Gephi allows you to customise, zoom in, filter and analyse your networks and get real-time updates of the network visualisation.
This is really helpful if you are still exploring an unfamiliar network, as it provides a hands-on, iterative way of uncovering insights. With Gephi, you can experiment with different layout algorithms and filters, making it an excellent tool for both learning about the intricate details of a network and visualising them.
Tutorial Case Study: Purchasing patterns of takeaway food
In this tutorial, we will be using a dataset based on a current Nesta research project as a case study. It contains information on commonly purchased ‘out of home’ meals (e.g. restaurant meals, ‘on the go’ foods and takeaways).
The meals were identified from a larger dataset of purchasing activity, where we utilised the Apriori algorithm in Python to produce a set of ‘association rules’ based on products commonly purchased together. This approach is an example of market basket analysis, a method widely used for discovering co-occurrence relationships among activities or items in large transactional datasets.
This example comes from some more recent work we have done in the ‘A Healthy Life’ data team aiming to better understand purchasing behaviour (and therefore calorie consumption) out of home (restaurants, cafes, takeaways ect) due to be published later this year. This follows on from our previous work measuring food and drink purchasing using k-means clustering and identifying targets for reformulation in the in-home sector (groceries).
Installation and setup
To download the Gephi to your machine simply head over to the Gephi website. Gephi is available for Windows, Mac, and Linux, so choose the version that suits your operating system.
On launching Gephi for the first time, you might be prompted to install additional plugins or update the software. It’s a good idea to do this to ensure you have all the latest features and fixes. It might take a bit longer to open up the first time as it sets up the environment on your machine.
Stage 1: Preparing your data
The first task will be to turn your raw data into network data — that is nodes and edges — that Gephi can use to create your network.
Gephi supports a variety of file formats for importing network data, including GEXF, GDF, GML, GraphML, Pajek NET, and CSV files (for a comprehensive list and details of these formats, you can refer to Gephi’s Supported Graph Formats).
For the purpose of this tutorial, we have chosen to use CSV files to represent our network data. This is because they are fairly easy to understand and work whilst also allowing us to define the separate node and edge information needed for our visualisation.
- Edge Files: These contain information about the relationships or connections in our network. In our working example, we will use the products in our dataset that are frequently purchased together.
- Node Files: Although Gephi can infer nodes from edge data, providing a separate node file gives us more control over how each node is represented in the visualisation. This can be particularly useful if we have additional attributes for each item (such as category, frequency, or other specific details) that we want to include in our network graph.
Producing the node and edge files in Python
From our larger dataset of food purchases, we used Python to identify ‘meals’ i.e. food items bought together.
We utilised the Apriori algorithm, which is used to determine the relationship between items and is often used by online retailers to suggest products to customers, to produce a set of ‘association rules’ based on products commonly purchased together.
This approach is an example of market basket analysis MBA), a method widely used for discovering co-occurrence relationships among activities or items in large transactional datasets.
Below I have included the code to run the MBA method and generate the edges and nodes CSV files (based on dummy purchase data) using the Apriori algorithm.
(As this tutorial is focused on visualising networks I won’t talk in detail about this method but there are tutorials on applying the Apriori algorithm at the end of this tutorial.)
In this first code snippet I create an example ‘purchase dataframe’ that includes the person ID, the purchase date and the food or drink category purchased. I then convert this DataFrame to a ‘list of lists’ where each list represents the components of a single transaction from an individual person on a single day.
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# Example purchase data
purchase_data = {
"person_id": [101, 101, 101, 202, 202, 202, 202,
202, 303, 303, 303, 404, 404, 404,],
"date": ["2024-01-01", "2024-01-01", "2024-01-01",
"2024-01-02", "2024-01-02", "2024-01-03",
"2024-01-03", "2024-01-03", "2024-01-04",
"2024-01-04", "2024-01-04", "2024-01-05",
"2024-01-05", "2024-01-05", ],
"Combined category": ["Pizza", "Soft Drink", "Salad",
"Burger", "Fries", "Pizza",
"Chicken Wings", "Soft Drink",
"Pasta", "Wine", "Garlic Bread",
"Burger", "Soft Drink", "Ice Cream", ]
}
purchase_df = pd.DataFrame(purchase_data)
# Create transactions from purchase data
transactions = (
purchase_df
.groupby(
['person_id', 'date']
)['Combined category']
.apply(list).tolist()
)
Converting transaction DataFrames into purchasing patterns (‘itemsets’)
In the next code snippet I convert the transactions to frequent ‘itemsets’ (i.e. the products bought together e.g. burger and fries) using the Apriori algorithm from the MLxtend library. An itemset is considered ‘frequent’ if it appears in a minimum percentage of all transactions (min_support). This level is user-defined so for the purposes of this tutorial I have set this to 0.2 (20% of transactions).
Once we have the frequent itemsets, we generate association rules from them.
Association rules tell us the way items are commonly grouped together in transactions.
The output of this process is a DataFrame of association rules including the following fields:
- Antecedents: The ‘primary’ item(s) implying the purchase of another item(s).
- Consequents: The item(s) implied to be purchased along with the antecedents.
- Support: The proportion of transactions that contain both the antecedent and the consequent.
- Confidence: The likelihood of the consequent being purchased when the antecedent is present.
- Lift: The ratio of the observed support to that expected if the antecedent and consequent were independent.
In the association rules function, we use metrics such as ‘lift’ to measure the strength of a rule (i.e. how strong the association is) so we can then set a minimum association threshold for the rules that appear (min_threshold).
Once the association rules have been created we convert these into separate DataFrames for the nodes and edge information. These are then saved as CSV files to be imported into Gephi in the next section.
# Initialize transaction encoder
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
# Convert to DataFrame
df_te = pd.DataFrame(te_ary, columns=te.columns_)
# Apply Apriori algorithm with a default minimum support threshold (0.5)
frequent_itemsets = apriori(df_te, min_support=0.2, use_colnames=True)
# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.1)
# Create DataFrame for nodes
items = set()
for itemset in frequent_itemsets['itemsets']:
items.update(itemset)
nodes = pd.DataFrame(list(items), columns=["Id"])
# Create DataFrame for edges
edges = rules[['antecedents', 'consequents', 'support',
'confidence', 'lift']].copy()
edges['Source'] = edges['antecedents'].apply(lambda x: ', '.join(list(x)))
edges['Target'] = edges['consequents'].apply(lambda x: ', '.join(list(x)))
edges = edges.drop(columns=['antecedents', 'consequents'])
# Save files to CSV
nodes.to_csv("nodes.csv", index=False)
edges.to_csv("edges.csv", index=False)
Stage 2: Importing your data into Gephi
Now you have your node and edge files ready, let’s import them into Gephi so we can start visualising our networks! Follow these steps:
Open Gephi and Start a New Project
- Launch Gephi.
- Go to ‘File’ > ‘New Project’ to start a new project.
Import the Node File
- Select ‘File’ > ‘Import Spreadsheet’.
- Choose your ‘nodes.csv’ file and click open. Make sure ‘import as’ is set to ‘Nodes table’ before clicking ‘Next’.
- Check your node field is in the right format in the final import window, then click ‘Finish’.
- An ‘import report’ window should appear once the files are imported. Here you can set the graph type as directed, undirected or mixed. For the purchasing rules we need ‘directed’. Once complete click ‘Ok.’
Import the Edge File
- Repeat the above steps for your ‘edges.csv’ file. This time, ensure you select ‘Edges table’.
- Review and finish the import process.
Review your imported data
- After importing both files, you should see your nodes and edges listed in the ‘Data Laboratory’. (See below)
- This is a good place to check your data has been imported correctly, and that there is no missing information.
Stage 3: Familiarising yourself with Gephi
Navigating Gephi
Overview of Gephi’s Main Areas
Gephi is broadly split into three main sections:
- Overview: This is where your network graph is displayed and where you’ll spend most of your time visualising and interacting with your network.
- Data Laboratory: The Data Laboratory gives an excel-like view of the data you’ve imported. Here you can view and edit your node and edge files.
- Preview: This tab lets you see how your network will look when it’s exported. You can play around with different settings to change the appearance of your network.
Stage 4: Generating your visualisation
Choosing a layout and visualising your network
After successfully importing your data into Gephi, the next step is to bring your network to life through visualisation. This starts with the selection and application of a layout algorithm.
What is a layout algorithm?
In network visualisation, a layout algorithm is a method for organising nodes and edges in a spatially coherent manner, with the aim of best representing the underlying structure of the network. They optimise the placement of nodes and edges to highlight key patterns, like clusters or the centrality of specific nodes.
Selecting and applying your layout algorithm
The choice of layout algorithm depends greatly on the nature of your network and the aspects you want to highlight in your visualisation. This tutorial from Gephi provides a useful introduction to different layouts and which to use for different network structures.
For the purpose of this tutorial, we will use the Fruchterman-Reingold layout.
Fruchterman-Reingold is a ‘force-directed’ layout algorithm. This means it simulates a physical system where nodes repel each other, and edges act like springs, pulling connected nodes closer. This results in a layout where nodes are spaced more evenly, and the structure of the network emerges naturally.
It’s particularly effective where the overall distribution and clustering of nodes are more important than the specific distances between them.
Select the Fruchterman-Reingold Algorithm
- First ensure that you have selected the edges file to apply the algorithm to.
- Navigate to the ‘Layout’ panel on the left side of the screen in the Overview tab.
- From the list of available algorithms, choose ‘Fruchterman-Reingold’.
Configure Algorithm Settings
- Before running the algorithm, you have the option to adjust its settings.
- For the purpose of this tutorial, start with the default settings and then experiment with adjustments like increasing or decreasing the ‘Gravity’ to see how it affects the node distribution.
Run the Algorithm
- Click the ‘Run’ button to start the Fruchterman-Reingold algorithm.
- You will see your nodes begin to move and reorganise. This process can take a few seconds to several minutes, depending on the size and complexity of your network.
Observe and Adjust
- Watch how the nodes and edges arrange themselves.
- If needed, pause the algorithm to tweak settings and then resume to see the effect of your changes. To check the results, go to the ‘Preview’ panel, make sure the edge file is selected and press the ‘Refresh’ button.
Finalise the Layout
- Once the nodes have settled into a stable configuration, click the ‘Stop’ button to finalise the layout.
Save Your Layout
- It’s a good practice to save your work at this stage. Go to ‘File’ > ‘Save’ to save your project.
Stage 5: Customising your visualisation
After applying a layout algorithm and achieving a stable configuration of your network, the next step is to customise your visualisation.
Gephi has a vast number of options to optimise your network, for the purpose of this tutorial we will look at:
- Adding text and labels
- Utilising various metrics and attributes
- Highlighting clusters
- Adjusting backgrounds and colours
Customising nodes and adding labels
- Click on the ‘Data Laboratory’ panel and select the edge file. Navigate to the data table field on the left and select ‘node’ (these are the nodes generated from the edge file). In the node table you will see an empty field next to Id called ‘Label.’
- Click ‘copy data to another column’, select ‘Id’ then ‘copy to label’
- Navigate to the ‘Preview’ settings. And select ‘show labels’ under ‘node labels.’ Here, you can also modify the ‘Label Font’ and ‘Label Size’. Once you are finished, refresh the preview pane to see the results.
- To customise the nodes, go to the nodes section under preview settings. Here you can change things like the transparency and colour of the nodes. To adjust the size of the nodes, navigate back to the ‘Overview’ panel, select appearance > nodes then select the size icon in the top right window. Select ‘unique’ and adjust the size from the drop down. Press ‘Apply’ before navigating back to the preview panel.
Adjusting colours based on metrics
- In the ‘Appearance’ panel, select ‘Edges’ and then ‘Ranking’ > ‘Attribute’ and choose your metric (e.g., Lift).
- This gives you the option to create a colour scale based on your metric of choice (lift in this case). Select ‘Apply’ and return to the preview panel.
- Under preview settings ensure ‘Edges > Colour’ is set to ‘original’ before refreshing to see the results.
Adding clusters to your visualisation
Modularity-Based Clustering (MBC) is a method used in network analysis to identify communities or neighbourhoods within a network. MBC calculates ‘modularity,’ a metric that quantifies how strongly a network divides into clusters.
For example a network with high modularity, would see strong connections between nodes within a cluster, but weak connections between nodes in different clusters.
- Run the ‘Modularity’ statistic in the ‘Statistics’ panel to identify clusters within your network.
- In the ‘Appearance’ panel, choose ‘Nodes’, select ‘Partition’ > ‘Modularity Class’, and apply distinct colours to each modularity class.
- This will colour-code your nodes based on their cluster, visually separating different communities in your network.
Changing the background colour
- Switch to the ‘Preview’ tab in Gephi.
- Locate the ‘Background’ option in the left-hand settings panel.
- Click on the colour box next to ‘Background’ to choose a new colour.
Changing to a darker background allows the node labels and edges with larger LIFT metrics to stand out better. Zooming in also helps to highlight some of the more central and better connected nodes such as cola, burgers and meal sides. These make sense as food and drink categories that would likely be purchased with other items as part of a meal.
Stage 6: Saving and exporting your visualisation
When you are happy with your customised visualisation, save your project and export your visualisation image.
Saving your project
- Go to ‘File’ in the top menu.
- Choose ‘Save’ or ‘Save As’ to save your project. Gephi projects are saved in .gephi format, which allows you to reopen and edit them later.
Exporting Your Network
- Navigate to the ‘Preview’ tab.
- Click on the ‘Export’ button and choose your preferred format (e.g. PNG, PDF).
Limitations of Gephi and useful resources
In this tutorial, we’ve covered the steps to visualise a network using Gephi, from installing the software and preparing your data to applying layout algorithms and customising your visualisation.
Limitations
While Gephi is a powerful tool, it does come with some limitations:
- Integration: As a standalone application, Gephi doesn’t integrate with other tools and workflows and can therefore present restrictions when trying to reproduce the networks remotely or integrating with data pipelines built in python. At the end of this tutorial I suggest a couple of options of python libraries as alternatives.
- Steep Learning Curve: For beginners, especially those not familiar with network theory, the learning curve can be quite steep. Understanding the nuances of its many features takes time and practice.
Useful Resources
- Gephi Documentation: Gephi Documentation
- A tutorial on performing market basket analysis and Apriori in Python
- Mlxtend Python library
Other tools for visualising networks (but using python):
I hope you found this tutorial useful, good luck making beautiful and interesting network visualisations!