A Gentle Introduction to Networkx with Python
As we know, networks are in several fields, like biology, computer science and even social sciences.
So, networks help us to understand and describe better the world, and why not, they are useful also to infer informations that we don’t know yet.
One of the most powerful tools to manage networks in Python is networkx. So, move on to see some commands.
First of all we need to import the library and then to choose which type of network we want to build:
- Graph: undirected network
- DiGraph: directed network
- MultiGraph: undirected network with self loops and parallel edges
Each type of graph will have different properties and operations available.
For instance we try to instanciate an undirected graph:
Now to give life to the network we need to add nodes and edges manually or starting from an existing dataset.
Adding information manually:
The simplest (and also boring) way to add node and attribute is shown below, where we are adding them one by one.
As we see, there is the possibility to add a node individually or directly an edge (so two nodes linked). When we add an edge to the network we can attach them some attributes. For instance, we can consider a social network where edges attributes could be years of friendship or circle of friends.
Adding information from a file:
When we have to deal with huge amount of data it is most common that we build a network starting from a dataset. For example, if we have a text file with nodes id values, networkx understand that couples of nodes will form the graph.
Basing on this dataset:
We can build and give a representation of the network in this way:
This is what we should see:
Now we can see some importat properties of a network and how we can extract information from it.
Distance
Sometimes is useful to know the the shortest path between two nodes, we can use the function shortest_path(). Other functtions are:
- shortest_path_length(): distance of the shortest path.
- average_shortest_path_length(): average distance between every pair of nodes.
- diameter(): maximum between every pair of nodes.
- eccentricity(): largest distance between a node and all the other nodes.
- periphery(): set of nodes that have eccentricity equal to the diameter.
- center(): set of nodes that have eccetricity equal to the radius (minimum eccentricity).
Clustering
The Clustering is the tendency for nodes in a network to become connected. Therefore, this allows us to understand what new connections can will be between the nodes of a network.
We can define two types of clustering:
- Local Clustering Coefficient of a node: measure of how neighbours of a node tend to form a complete graph
- Global Clustering Coefficient (Transitivity): ratio between the number of closed triangles and the total number of triangles. This measure the the clustering of the whole network.
Network Centrality
There are some measures that identify the most important nodes in the network. This property can be applied in various fields, we can think for example at telecommunications networks or computer networks, it is important to identify the important nodes for network optimizations.
Among the important metrics we must consider:
- Degree centrality: number of links incident upon a node. So, more links a node has, the more important it is.
- Closeness centrality: the average length of the shortest path between the node and all other nodes in the graph. So, important nodes are close to other nodes.
- Betweenness centrality: the number of times a node acts as a bridge along the shortest path between two other nodes. So, important nodes connect other nodes.
- Eigenvector centrality: influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes
- Percolation centrality: this measure differs from the others because it focuses more on the state of a node in the network, rather than on the topology. This is very important in dynamic networks that describe the spread of viruses or news in social networks. It is defined for a given node, at a given time, as the proportion of ‘percolated paths’ that go through that node. A ‘percolated path’ is a shortest path between a pair of nodes, where the source node is percolated (e.g., infected). The target node can be percolated or non-percolated, or in a partially percolated state. The state must be specified with a dictionary argument of the function, where nodes are keys and states are values.
Measures for link prediction
In a network it is important to analyze the relationship that exists between two nodes, especially if then you want to predict new connections in the network. Some of the metrics capable of compare pairs of nodes are:
- Preferential attachment: more connected a node is, the more likely it is to receive new links. Since nodes with high degree get more neighbors, this metric is the product of the nodes’ degree.
- Common neighbors: captures the idea that two strangers who have a friend in common are more likely to be introduced than those who don’t have any friends in common. So this metric simply indicates the number of nodes in common between two nodes.
- Jaccard Coefficient: following the same principle as the common neighbors, it is the ratio between the common neighbors of the two nodes and all the neighbors of the two nodes.
- Resource Allocation Index: fraction of information that a node can send to another node through their common neighbors. So if two nodes have nodes in common with a high degree, there will be a low ROI, vice versa if the neighbors have a low degree.
I hope this introduction to network analysis could be helpful, especially for who is at the beginning.
References:
- Network Centrality, Wikipedia
- D. Liben-Nowell, J. Kleinberg. The Link Prediction Problem for Social Networks (2004). http://www.cs.cornell.edu/home/kleinber/link-pred.pdf