Practical Data Science: Graph and network processing

Gritinai
5 min readAug 28, 2022

--

Networks and graph

Networks are the systems of interrelated objects (in the real world) while Graphs are the mathematical model for representing networks, in data science we use these algorithms to answer questions about networks.

Graphs models

A graph is a collection of vertices (nodes) and edges 𝐺 = (𝑉, 𝐸)

𝑉 = {𝐴, 𝐵, 𝐶,𝐷, 𝐸, 𝐹 }

𝐸 = {(𝐴, 𝐵 ), (𝐴, 𝐶 ), (𝐵, 𝐶 ), (𝐶,𝐷 ), (𝐷, 𝐸 ), (𝐷, 𝐹) , (𝐸, F)}

there are six nodes, labeled A through F, and with edges corresponding to the lines between these nodes in the figure. There are many different properties of graphs, let’s focus on a few of the more common distinction between different types of graphs.

Directed vs. undirected graphs

Directed are represented with arrows showing direction, while the undirected does not.

Weighted vs. unweighted graphs

For weighted edges between nodes have some real-valued weight associated with them, while for unweighted all edges are equal.

Representations of graphs

There are a few different ways that graphs can be represented in a program, which one you choose depends on your use case

E.g., are you going to be modifying the graph dynamically (adding/removing nodes/edges), just analyzing a static graph, etc?

Three main types we will consider:

  1. Adjacency list
  2. Adjacency dictionary
  3. Adjacency matrix

Adjacency list

For each node, let’s store an array of the nodes that it connects to. To do this as a “pure” list, we first need to associate each node in the graph with an index, which we can do with a simple list.

nodes = ["A", "B", "C", "D"]
print(nodes)

we can also create a dictionary of nodes to indices, like so.

nodes_dict = {"A":0, "B":1, "C":2, "D":3}
nodes_dict = {k:i for i,k in enumerate(nodes)} # same as the above, #done programmatically
print(nodes_dict)

Pros: easy to get all outgoing links from a given node, fast to add new edges (without checking for duplicates)

Cons: deleting edges or checking existence of an edge requires scan through given node’s full adjacency array

Adjacency dictionary

For each node, let’s store a dictionary of the nodes that it connects to. Let’s see how to represent the above graph, here using weights but explicitly setting them all to one.

adj_dict = {"A": {"B":1.0}, "B":{"C":1.0}, "C":{"A":1.0, "D":1.0}, "D":{}}

this representation encodes the following facts:

  • adj_list["A"] = {"B":1.0} means that node “A” has a link to node “B” (e.g. with weight 1.0)
  • adj_list["B"] = {"C":1.0} means that node “B” has a link to node “C”
  • adj_list["C"] = {"A":1.0, "D":1.0} means that node “C” has links to nodes “A” and “D”
  • adj_list["D"] = {} means that node “D” has no outgoing links

Pros: easy to add/remove/query edges (requires two dictionary lookups, so a 𝑂(1) operation)

Cons: overhead of using a dictionary over array

Adjacency matrix

It stores the connectivity of the graph as a matrix

nodes = ["A", "B", "C", "D"]
nodes_dict = {k:i for i,k in enumerate(nodes)}

Graph algorithms

This write up is an inadequate setting to properly discuss graph algorithms, which is essentially an entire subfield of computer science in and of itself. Lets list out some of them here and provide links for further reading.

  1. Dijkstra’s algorithm for computing single-source shortest path(Dijskstra’s algorithm)
  2. The PageRank algorithm for computing node importance(the algorithm that started Google Page et al., “The PageRank Citation Ranking: Bringing Order to the Web.” , the PageRank paper)
  3. The Girvan-Newman algorithm for community detection(published in 2002, one of the first methods of “modern” community detection Girvan and Newman, “Community structure in social and biological networks” , original source for Girvan-Newman algorithm)

Graph libraries

NetworkX

https://networkx.github.io/

NetworkX: Python library for dealing with (medium-sized) graphs. A simple Python interface for constructing graph, querying information about the graph, and running a large suite of algorithms but not suitable for very large graphs (all native Python, using adjacency dictionary representation)

Creating graphs

Creating an undirected or directed graph

import networkx as nx 
G = nx.Graph() # undirected graph
G = nx.DiGraph() # directed graph

we can remove nodes using the .remove_node() or .remove_nodes_from() calls. A list of nodes can be given by the .nodes() call.

# add and remove edges 
G.add_edges_from([("A","B"), ("B","C"), ("C","A"), ("C","D")]) G.remove_edge("A","B") G.add_edge("A","B") G.remove_edges_from([("A","B"), ("B","C")]) G.add_edges_from([("A","B"), ("B","C")])
# also add_node(), remove_node(), add_nodes_form(), #remove_nodes_from()

Nodes/edges and properties

NetworkX uses adjacency dictionary format internally to represent the graph. We can access these elements by just indexing into the graph object itself.

print G["C"] 
# {'A': {}, 'D': {}}

To iterate over nodes and edges we can do the following

for i in G.nodes(): # loop over nodes 
print i
for i,j in G.edges(): # loop over edges
print i,j

To get and set node/edge properties we can do this

G.node["A"]["node_property"] = "node_value" 
G.edge["A"]["B"]["edge_property"] = "edge_value"
G.nodes(data=True) # iterator over nodes returning properties G.edges(data=True) # iterator over edges returning properties

Drawing and node properties

NetworkX has some built in routines for drawing graphs, though generally speaking they do not produce the most visually compelling output.

Here let’s draw a graph using matplotlib (not the best visualization)

import matplotlib.pyplot as plt 
%matplotlib inline
nx.draw(G,with_labels=True)
plt.savefig("mpl_graph.pdf")

Algorithms

NetworkX has a large number of graph algorithms built in. A full list is here: NetworkX algorithms. Let’s look at just the algorithms we have described so far.

  1. considering the shortest path algorithm
G = nx.Graph()
G.add_edges_from([("A","B"), ("B","C"), ("C","A"), ("C","D")])
print(nx.shortest_path_length(G, source="A"))
print(nx.shortest_path(G, source="A"))

2. The PageRank implementation takes an alpha parameters which is 1−d is our notation.

G = nx.DiGraph()
G.add_edges_from([("A","B"), ("B","C"), ("C","A"), ("C","D")])
nx.pagerank(G, alpha=0.9)

3. Girvan-Newman

G = nx.Graph()
G.add_edges_from([(1,2), (1,3), (2,3), (4,5), (4,6), (5,6),
(9,10), (9,11), (10,11), (12,13), (12,14), (13,14),
(3,7), (6,7), (7,8), (8,9), (8,12)])
communities = nx.community.girvan_newman(G)
list(communities)

Thank you!!

Do like, comment and share

Check out www.gritinai.com

--

--

Gritinai

GritinAI manages companies database and bridge the gap between businesses and their customers, also conducting training's, building products and mentoring.