Computing Connected Components in Graphs using SQL & Python

15 min readNov 21, 2023

Introduction to Graphs

Before we dive into connected components, let’s talk about graphs. A graph is a data structure that represents connections or relationships between different entities. Graphs play an integral role in our daily lives, both as individuals and in businesses. From social networks and transportation systems to communication networks and project management, graphs provide a powerful framework for understanding and modeling diverse relationships.

In a graph, we have two main parts: nodes and edges. Nodes (or vertices) are like points that represent individual things, and edges (or links) are the connections/relationships between these things. Think of nodes as dots and edges as lines connecting these dots.

Types of Graphs

Now, let’s talk about the different types of graphs. There are two main types: directed and undirected graphs.

Directed Graphs: In a directed graph, edges have a direction. It means they go from one node to another, not the other way around. Think of it like a one-way street. Moreover, you can have two directed edges (in opposite directions) between two nodes allowing a two-way communication.
Undirected Graphs: In an undirected graph, edges don’t have a direction. The connection goes both ways, like a two-way street.

Understanding these types helps us describe how things are connected in various scenarios.

Examples of Directed and Undirected Graph

Bipartite and Multipartite Graphs

Moving on, there are special categories of graphs called bipartite and multipartite graphs.

Bipartite Graphs: In a bipartite graph, nodes can be split into two groups, and edges only connect nodes from the different groups. It’s like a social gathering where everyone is either a part of group A or group B, and connections only happen between the two groups. To make this clearer, let’s look at an example of a bipartite graph in real life — a movie cast and the roles they play. Actors form one group, roles another, and edges connect actors to the roles they portray.

Multipartite Graphs: Multipartite graphs take this idea further. Nodes can be divided into more than two groups, and edges connect nodes from different groups.

Connected Components in Graphs

Connected components are groups of nodes that are closely linked within a graph. In a connected component, each node is directly or indirectly connected to every other node in the same group. It’s like a little community of nodes that stick together.

Understanding connected components helps us see the structure of a graph. It allows us to find groups of nodes that are strongly connected, showing us patterns and relationships — from understanding friendships on social media to planning efficient transportation routes.

Stay with us as we explore the various ways of finding/computing connected components in graphs.

*Graph showing three connected components*

Analyzing or Finding Connected Components in Large-Scale Graphs

While identifying connected components in a small graph is straightforward, it becomes computationally challenging on a large scale. Here, we’ll explore some methods and techniques used to compute connected components in extensive systems.

1. Graph Traversal Algorithms:

Depth-First Search (DFS): It’s a graph traversal algorithm that starts at a chosen node, explores as far as possible along each branch before backtracking, and continues until all nodes are visited. DFS is like searching for a lost item in your house by looking deep into one room, then backtracking and trying another room until you find it.

Breadth-First Search (BFS): This graph traversal algorithm begins at a selected node, explores all neighbours at the current level before moving to the next level, and repeats until all nodes are covered. Imagine exploring a building floor by floor, checking each room before moving to the next level.

For the purpose of finding connected components in a graph, DFS is highly advisable if memory is a concern.

2. NetworkX Built-in Algorithm: NetworkX is a Python library designed for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks (or graphs). It provides a variety of built-in algorithms to perform common graph-related tasks efficiently including a dedicated function for computing connected components.

3. Using SQL (Recursive and Non-Recursive): This involves creating a SQL query that walks through a graph table to identify connected nodes. The non-recursive SQL approach typically employs a series of table creation to traverse connections, while the recursive SQL utilizes a recursive common table expression (CTE) to explore connected nodes iteratively. This method (both recursive and non-recursive) is suitable for scenarios where the graph data is stored in a relational database, allowing for efficient querying and analysis.

4. Using Graph Database Query: Using a Graph Database Query involves leveraging a graph database system like Neo4j or Amazon Neptune to compute connected components efficiently. In graph databases, queries are expressed using graph query languages like Cypher (for Neo4j). The query typically involves traversing relationships or paths in the graph to identify nodes that form connected components, offering a powerful and optimized approach for analyzing complex relationships within the data. This method is particularly beneficial for scenarios where data relationships are best represented and queried in a graph structure.

These methods offer diverse ways to compute connected components, catering to different needs and preferences. Choose the one that aligns best with your application’s requirements and infrastructure.

Practical Example

Now, we will be exploring practical steps in finding connected components in graphs. First, we will be generating a sample undirected bipartite graph using the NetworkX Python library. Subsequently, we will be exploring the generated graph, transforming it and storing it in different storage systems.

Tools/Technologies Requirement:

Python 3
NetworkX, Pandas and SqlAlchemy Python libraries
Postgres Database
Neo4j Graph Database & Graph Data Science Library

Generating Sample Graph

Step 1: Importing Libraries

import networkx as nx
from networkx.algorithms import bipartite
import random
import matplotlib.pyplot as plt
import pandas as pd
from sqlalchemy import create_engine

If you don’t have the networkx library in your Python environment, you can run the code below (using pip) on your terminal to install. Of course, you can also do the same for other libraries that require installation.

pip install networkx

Here, we import necessary libraries, including NetworkX for graph operations, random for generating random edges, matplotlib for visualization, pandas for data manipulation and sqlalchemy for storing data into a relational database.

Step 2: Generating an Undirected Bipartite Graph

# Generate bipartite graph with required number of components
def generate_bipartite_graph_with_components(num_nodes_per_component_per_set, num_edges_per_component, remove_isolated_nodes=False):
    G = nx.Graph()
    least_num_components = 3
    for component in range(least_num_components):
        # Create two sets of nodes for each component
        set_A = set(f'A_{component}_{i}' for i in range(num_nodes_per_component_per_set))
        set_B = set(f'B_{component}_{i}' for i in range(num_nodes_per_component_per_set))
        # Add nodes to the graph
        G.add_nodes_from(set_A, bipartite=0)
        G.add_nodes_from(set_B, bipartite=1)
        # Create random edges within each component
        for _ in range(num_edges_per_component):
            node_a = random.choice(list(set_A))
            node_b = random.choice(list(set_B))
            G.add_edge(node_a, node_b)
    if remove_isolated_nodes:
        # remove isolated nodes. This will prevent disconnected sets
        G.remove_nodes_from(list(nx.isolates(G)))
    return G


# Example usage:
num_nodes_per_component = 3
num_edges_per_component = 5

bipartite_graph = generate_bipartite_graph_with_components(num_nodes_per_component, num_edges_per_component, remove_isolated_nodes=False)

# top_nodes represent the first set in the bipartite graph
top_nodes = {n for n,d in bipartite_graph.nodes(data=True) if d['bipartite']==0}

# top_nodes represent the first set in the bipartite graph
bottom_nodes = {n for n,d in bipartite_graph.nodes(data=True) if d['bipartite']==1}

This function, generate_bipartite_graph_with_components, creates a bipartite graph using the NetworkX Python library. The graph consists of multiple components, each containing two sets of nodes (bipartite sets), with edges randomly connecting nodes between the two sets within each component. The number of nodes per set, the number of edges per component, and an option to remove isolated nodes are customizable parameters.

The function initializes an empty graph (G) and iterates over a specified number of components, creating bipartite sets (set_A and set_B) for each component. Nodes are added to the graph with appropriate bipartite attributes. Random edges are then generated within each component, connecting nodes from set A to set B. Optionally, isolated nodes can be removed to ensure connected bipartite sets.

The example usage demonstrates how to call the function and extract the top and bottom nodes of the resulting bipartite graph, distinguishing between the two sets.

Step 3: Visualizing the Bipartite Graph

# Add colors to each set of nodes
color_map = []
for node in bipartite_graph:
    if node in top_nodes:
        color_map.append('skyblue')
    else: 
        color_map.append('orange')      


# Graph vizualization
pos = nx.bipartite_layout(bipartite_graph, top_nodes)
nx.draw(bipartite_graph, pos, node_color=color_map, node_size=200, font_color='#000', edge_color='#bbb', font_size=10, with_labels=True)
plt.show()

This code visualizes the bipartite graph using networkx and matplotlib Python libraries. Nodes from the two sets are assigned different colours, and the graph is displayed with labels.

Step 4: Analyzing and Transforming the Bipartite Graph

#--------------------------- Exploration ---------------------------

# Get all the nodes in the graph
print(bipartite_graph.nodes)

# Get all the edges in the graph
print(bipartite_graph.edges)

# Get isolated nodes
isolated_nodes = set(nx.isolates(bipartite_graph))

# Bipartite graph in a dataframe structure (excluding isolated nodes)
df_exluding_isolated_nodes = pd.DataFrame(bipartite_graph.edges)

# Get top nodes without link to any bottom nodes
isolated_top_nodes = top_nodes.intersection(isolated_nodes)

# Get bottom nodes without link to any top nodes
isolated_bottom_nodes = bottom_nodes.intersection(isolated_nodes)


#--------------------------- Transformation ---------------------------

# Converting graph to a dataframe: Tranforming graph edges to dataframe automatically excludes isolated edges
df_exluding_isolated_nodes = pd.DataFrame(bipartite_graph.edges, columns=['from','to'])
# alternative approach: df_exluding_isolated_nodes = nx.to_pandas_edgelist(bipartite_graph, source='from', target='to')

# Re-introducing the isolated nodes back to the Pandas DataFrame
isolated_top_nodes_df = pd.DataFrame([[t, None] for t in isolated_top_nodes])
isolated_bottom_nodes_df = pd.DataFrame([[None, t] for t in isolated_bottom_nodes])
df_including_isolated_nodes = pd.concat([df_exluding_isolated_nodes, isolated_top_nodes_df, isolated_bottom_nodes_df])
df_including_isolated_nodes.reset_index(inplace=True, drop=True)

# Converts the graph to a dictionary of lists where each key represents unique nodes and values are its adjacent nodes
dict_graph = nx.to_dict_of_lists(bipartite_graph)

Here, we explore basic graph information, identify isolated nodes, and represent the undirected bipartite graph in a dictionary and DataFrame structures (excluding and including isolated nodes).

Step 5: Storing Graph Data in a Relational Database (PostgreSQL)

# Storing graph data in a PostgreSQL database using SQLAlchemy
# Define your PostgreSQL database connection string
# Replace with your database credentials, database name, and other connection details
username = 'your_username'
password = 'your_password'
database_name = 'your_database_name'
host = 'your_database_host'
db_connection_str = f'postgresql://{username}:{password}@{host}/{database_name}'

# Create a SQLAlchemy engine to connect to the database
engine = create_engine(db_connection_str)

# Store the DataFrame in the PostgreSQL database
table_name = 'graph'
df_including_isolated_nodes.to_sql(table_name, engine, if_exists='replace', index=False)
print(f'DataFrame stored in PostgreSQL table: {table_name}')

This step demonstrates how to store the bipartite graph data in a PostgreSQL database using the SQLAlchemy library. ElephantSQL has a free plan for their managed Postgres database. You can quickly spin up a Postgres database to try out this exercise. With few modifications, the same code can be adapted to store the graph data in other relational databases.

Step 6: Storing Graph Data in a Neo4j

To facilitate learning, we will utilize the Neo4j sandbox which remains active for 3 days as at the time of creating this tutorial. You can access the sandbox platform directly here or refer to this link for instructions on setting up a Neo4j sandbox. After setup, you should see a page similar to the image below. Expand the “Blank Sandbox” row and click the “Connection details” tab to retrieve the information necessary to establish a connection from Python.

Run the code below in Python to load the generated graph above to Neo4j. Please remember that the “top_nodes” and “bottom_nodes” objects used in the code have been created in the “Generating Graph” section. Also note that the literal strings in the “run” function below are Cypher queries native to Neo4j.

Warning: The first line in the create_graph function will wipe all data in your graph. Do not run if you want to keep any existing graph.

# Install neo4j python driver
pip install neo4j

from neo4j import GraphDatabase, basic_auth

# Function to insert nodes and relationships into Neo4j
def create_graph(tx, nodes_type_A, nodes_type_B, edges):

    # Warning: delete existing nodes and relationships 
    tx.run("MATCH (n) DETACH DELETE n")

    # insert nodes in the first set
    for node in nodes_type_A:
        tx.run("CREATE (:SetA {id: $node})", node=node)

    # insert nodes in the second set
    for node in nodes_type_B:
        tx.run("CREATE (:SetB {id: $node})", node=node)

    # add relationships in both directions to represent an undirected edge
    for edge in edges:
        tx.run("MATCH (a:SetA {id: $start}), (b:SetB {id: $end}) " 
               "CREATE (a)-[:CONNECTED]->(b)"
               "CREATE (b)-[:CONNECTED]->(a)", start=edge[0], end=edge[1])


NEO4J_URI = "bolt://54.237.152.8:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "maintainability-utility-swells"

# Create a Neo4j session and insert data
with GraphDatabase.driver(NEO4J_URI, auth=basic_auth(NEO4J_USERNAME, NEO4J_PASSWORD)) as driver:
    with driver.session() as session:
        session.execute_write(create_graph, top_nodes, bottom_nodes, bipartite_graph.edges)

Finding Connected Components in Neo4j (Graph Database)

There are two ways to approach finding connected components in Neo4j.

The first, and more straightforward, involves locating all nodes connected to a specific node. The following Cypher query illustrates the process of identifying all nodes linked (either directly or indirectly) to the node with the id “A_1_1”.

MATCH ({id : 'A_1_1'})-[*]-(connected)
RETURN connected

Cypher Query and Graph Output Illustration Showing Connected Nodes to Node A_1_1

The alternative approach aims to recognize all connected components within the graph. To perform this exercise, the Neo4j Graph Data Science library will be required. For further insights into the library, you can refer to this link. Additionally, implementation details for discovering connected components can be accessed here. Essentially, there are two key steps in achieving this objective. The first thing is to write a Cypher query to “project” our graph under the name “bipartite_graph” (Before we can apply any algorithm in the Neo4j GDS library, the graph to run on would have to be projected under a name).

Please note that projected graphs reside in memory (Graph catalog). While this might boost performance, carefully consider this when making your architectural choice.

CALL gds.graph.project(
  'bipartite_graph',
  ['SetA','SetB'],
  'CONNECTED'
)

Followed by executing the DGS wcc algorithm (Weakly Connected Components) by running the Cypher Query below. Note the use of “bipartite_graph” defined in the previous Cypher query.

CALL gds.wcc.stream('bipartite_graph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).id AS name, componentId
ORDER BY componentId, name

Cypher Query to Generate Connected Components in Graph

Finding Connected Components in Graph: Using NetworkX Function

# Using the built-in connected components function
components = list(nx.connected_components(bipartite_graph))

# Displaying the result
print("Connected Components (NetworkX):", components)

This is a straightforward application of the Networkx library for finding connected components by passing the bipartite_graph object produced in Step 2 of the Generating Graph section. Also, this will require memory for computation.

Finding Connected Components in Graph: Using Depth First Search (DFS) Algorithm

# Applying Depth First Search algorithm in finding connected components in a graph
def dfs(graph, start, visited, component):
    visited[start] = True
    component.append(start)
    for neighbor in graph[start]:
        if not visited[neighbor]:
            dfs(graph, neighbor, visited, component)

# dict_graph object was created in Step 4 of Generating Graph section
visited = {node: False for node in dict_graph}
components = []
for node in dict_graph:
    if not visited[node]:
        component = []
        dfs(dict_graph, node, visited, component)
        components.append(component)
print("Connected Components (DFS):", components)

We applied the DFS algorithm implemented above on the dict_graph object created in Step 4 of the Generating Graph section to output the list of connected nodes.

Finding Connected Components in Graph: Using Recursive SQL

This approach involves the use of recursion in SQL. If you are not familiar with this concept, I’ll advise you to watch this tutorial by Thoufiq.

The code is based on the “graph” table (Postgres) created in Step 5 of the “Generating Graph” section.

-- Codify the nodes for subsequent operations
drop table if exists coded_nodes;

create table coded_nodes as
(
select node_label, row_number() over(order by node_label) node_num
from
(
select "from" as node_label from graph where "from" is not null
union
select "to" as node_label from graph where "to" is not null
) as nn
);

drop table if exists coded_graph;

create table coded_graph as
(
select 
g."from" as original_from, g."to" as original_to, 
c.node_num as "from", d.node_num as "to"
from graph as g
left join coded_nodes as c
on g."from"=c.node_label
left join coded_nodes as d
on g."to"=d.node_label
);

-- A table representing undirected graph by adding a duplicate edges in flip directions
drop table if exists undirected_graph;

create table undirected_graph as
(
select "from" as from, "to" as to
from coded_graph
union
select "to" as from, "from" as to
from coded_graph
);

-- Table based on recursive sql that walks from each node to its adjancent nodes
drop table if exists walks;

create table walks as
(
with recursive walks as (
select node_num as node, node_num as front
from coded_nodes
union -- prevents duplicate front nodes will be included
select w.node, g.to
from walks as w
inner join undirected_graph as g 
on w.front = g.from
)
select * from walks
);

-- Similar to the query above with additional column named lvl that shows the level of walk before reaching a node
-- lvl starts from 0, which represents self referencing nodes
drop table if exists walks_with_levels;

create table walks_with_levels as
(
with recursive walks_with_levels as 
(
select node_num as node, node_num as front, 0 as lvl
from coded_nodes
union 
select w.node, g.to, w.lvl+1 as lvl
from walks_with_levels as w
inner join undirected_graph as g 
on w.front = g.from
where w.lvl<10
)
select * from
(
select distinct on (node, front) *
from walks_with_levels
order by node, front, lvl
) as tbl
order by lvl, node
);

-- Use the minimum "front" for each "node" as components

drop table if exists components;

create table components as
(
with comp_base as 
(
select node, min(front) as component
from walks
group by node
order by node
)
select node, 'C' || dense_rank() over(order by component) as comp
from comp_base
);

-- Identify subgraphs based on the components
drop table if exists subgraphs;

create table subgraphs as
(
select distinct c.comp, x.from, x.to
from components as c
left join undirected_graph as x
on c.node in (x.from, x.to)
);

select * from components

The initial step involves constructing a table containing distinct nodes found in the graph. Given that the undirected graph lacks specific edge directions, it is represented by duplicating rows in the graph table with inverted nodes. Subsequently, both the nodes and undirected graph tables are employed in a recursive query, creating the walk table to trace connectivity. This process can be executed with or without considering the depth of the walk (using the “lvl” field). When the depth of walk is included in the query, a “where” clause is added to prevent an endless loop. Identification of components is achieved by grouping the node column in the walk table and determining the minimum front. Lastly, subgraphs are generated for each component.

Finding Connected Components in Graph: Using Non-Recursive SQL

The non-recursive approach provides a cover from the memory-intensive nature of the recursive SQL. However, this is not without a cost. The non-recursive method involves a long laborious creation of tables (low memory, more storage required) for each step of walk or traversal through the graph.

--This table represents the original graph structure. It's the starting point for the analysis, capturing the initial state of the graph.
drop table if exists walk_0;

create table walk_0 as
(
select original_from, original_to, "from" as cur_from, "to" as cur_to
from coded_graph
);

-- This table is derived from walk_0
drop table if exists walk_1;

create table walk_1 as
(
SELECT 
original_from,
original_to,
(CASE WHEN tmp_from < cur_from THEN tmp_from ELSE cur_from END) AS cur_from,
(CASE WHEN tmp_to < cur_to THEN tmp_to ELSE cur_to END) AS cur_to
FROM
(
SELECT
original_from, 
original_to, 
cur_from, 
cur_to,
MIN(cur_from) OVER (PARTITION BY cur_to) AS tmp_to,
MIN(cur_to) OVER (PARTITION BY cur_from) AS tmp_from
FROM walk_0
) AS tmp_table
);

-- This table is derived from walk_1 and is used for further refinement of connected components
drop table if exists walk_2;

create table walk_2 as
(
SELECT 
original_from,
original_to,
(CASE WHEN tmp_from < cur_from THEN tmp_from ELSE cur_from END) AS cur_from,
(CASE WHEN tmp_to < cur_to THEN tmp_to ELSE cur_to END) AS cur_to
FROM
(
SELECT
original_from, 
original_to, 
cur_from, 
cur_to,
MIN(cur_from) OVER (PARTITION BY cur_to) AS tmp_to,
MIN(cur_to) OVER (PARTITION BY cur_from) AS tmp_from
FROM walk_1
) AS tmp_table
);

drop table if exists subgraphs;

create table subgraphs as
(
select original_from, original_to, 'C' || dense_rank() over(order by cur_from) as comp
from walk_2
);


drop table if exists components;

create table components as
(
select original_from as node, comp as component
from subgraphs
union
select original_to as node, comp as component
from subgraphs
order by component
);

Here is a non-recursive SQL approach which involves creating a new table for each walk instead of the iteration in memory employed by the recursive method. The number of walk tables to create is determined by the depth of the analysis and refinement of the connected components you want to perform.

Which is the best pick for you?

By conducting a thorough analysis, including factors such as system architecture and budget constraints, you can identify the most suitable method to address your graph-related challenges.

For instance, opting for non-recursive SQL becomes a favourable choice when dealing with substantial storage capacity while prioritizing minimal investment in memory resources.

More References:

https://www.youtube.com/watch?v=L967JqNFxkw