GPU Parallelization of Community Detection using OpenCL

Utilizing GPU through OpenCL programs to parallelize Brandes’ Algorithm for betweenness centrality.

Published in

smucs

3 min readDec 7, 2021

Algorithm Pseudo-code

FOR i = 0 to n iterations
  // find edge with highest betweenness  CALL initializeCentralityGPUBuffers
  FOR vertex V in the graph
    CALL initializeDijkstraGPUBuffers
    CALL parallelDijkstra with V as source vertex
    SET current_depth as highest cost in costArray    WHILE current_depth > 0
      CALL parallelCalculatePairDependency with depth as current_depth, and V as source vertex
      DECREMENT current_depth by 1
    END WHILE
    
    CALL parallelIncrementCentrality with V as source vertex
  END FOR
  // find edge with highest centrality and remove it. 
  CALL removeMostCentralEdgeEND FOR

Graph Data Structures for OpenCL

For storing the graph information within a GPU buffer, vertices are assumed to be sequentially numbered and edges are stored within a sparse matrix to conserve memory space.

struct GraphData {
    // (V) This contains a pointer to the edge list for each vertex
    int *vertexArray;

    // Vertex count
    int vertexCount;

    // (E) This contains pointers to the vertices that each edge is attached to in CSR matrix format.
    int *edgeArray;

    // (E) This contains pointers to the vertices that each edge is attached to in COO matrix format.
    int *edgeArray2;

    // Edge count
    int edgeCount;
}

Parallelization of Dijkstra’s Algorithm

The OpenCL implementation of Dijkstra for single source shortest paths that I am using is similar to that of the example within the OpenCL Programming Guide. My implementation does not consider the weights of the edges, but keeps track of the number of shortest paths for each vertex in the sigmaArray. The information for vertex, edge, and current cost per vertex are stored in read-only arrays. The algorithm is split into two stages.

The first stage, any vertex marked for exploration within the maskArray has its neighbors explored. Initially, only the source vertex is marked for exploration within the maskArray. By exploring a vertex, its cost is updated within the updatingCostArray , which is a writable copy of the costArray.

if ( maskArray[tid] != 0 )
{
    maskArray[tid] = 0;    // get the index of the last edge for our vertex.
    int edgeStart = vertexArray[tid];
    int edgeEnd;
    if (tid + 1 < (vertexCount))
    {
        edgeEnd = vertexArray[tid + 1];
    }
    else
    {
        edgeEnd = edgeCount;
    }    // iterate through all neighbors.
    for(int edge = edgeStart; edge < edgeEnd; edge++)
    {
        int nid = edgeArray[edge];

        if (updatingCostArray[nid] > (costArray[tid] + 1))
        {
            updatingCostArray[nid] = (costArray[tid] + 1);
        }

        if(updatingCostArray[nid] && (updatingCostArray[nid] == (costArray[tid] + 1))) {
            atomic_add(&sigmaArray[nid], sigmaArray[tid]); // we found another shortest path
        }
    }
}

The second stage updates costArray with any changes from updatingCostArray that results in a higher cost for that vertex. That vertex is also marked within the maskArray for exploration for the next iteration of the first stage.

if (costArray[tid] > updatingCostArray[tid])
{
    costArray[tid] = updatingCostArray[tid];
    maskArray[tid] = 1;
}

updatingCostArray[tid] = costArray[tid];

Both kernels are continuously ran until maskArray does not contain a 1 at any index.

Parallelization of Edge Betweenness Centrality

Brandes’ algorithm for betweenness centrality states that when calculating the dependency of a vertex, the partial sums of the pair-wise dependencies follow a recurrence relation with each other if they are on the same path.

We start with edges farthest from the source vertex, decrementing our current depth with each iteration of the kernel. edgeArray2 provides a way to access the vertex of an edge, given an edge index. The vertices contained in this array will be the neighbors of vertices contained within edgeArray (same order of edges). For each pair of connected vertices, v and w, we calculate the partial pair dependency of vertex v and accumulate it to the dependency of v within the dependencyArray. We are using atomic addition to avoid data races between threads on the GPU.

int v = edgeArray2[tid];

if(costArray[v] != currentDepth)
    return;

int w = edgeArray[tid];

// Do we pass v to get to w?
if(costArray[w] == (costArray[v] + 1)) {
    if(sigmaArray[w] != 0) {
        float sigma_v = (float)sigmaArray[v];
        float sigma_w = (float)sigmaArray[w];
         atomic_add_global(dependencyArray + v, (sigma_v * 1.0 / sigma_w) * (1 + dependencyArray[w]));
    }
}

After running the above kernel for each depth level, the centrality of each vertex is updated, ignoring the current source vertex.

__kernel void finalizeDependency(__global float *centrality, __global float* dependency, int vertexCount, int sourceVertex) {
    // access thread id
    int v = get_global_id(0);

    if(v >= vertexCount)
        return;

    if(v == sourceVertex)
        return;

    // shortest paths are counted twice.
    centrality[v] += dependency[v] / 2;
}