GPU Parallelization of Community Detection using OpenCL
Utilizing GPU through OpenCL programs to parallelize Brandes’ Algorithm for betweenness centrality.
Algorithm Pseudo-code
FOR i = 0 to n iterations
// find edge with highest betweenness CALL initializeCentralityGPUBuffers
FOR vertex V in the graph
CALL initializeDijkstraGPUBuffers
CALL parallelDijkstra with V as source vertex
SET current_depth as highest cost in costArray WHILE current_depth > 0
CALL parallelCalculatePairDependency with depth as current_depth, and V as source vertex
DECREMENT current_depth by 1
END WHILE
CALL parallelIncrementCentrality with V as source vertex
END FOR
// find edge with highest centrality and remove it.
CALL removeMostCentralEdgeEND FOR
Graph Data Structures for OpenCL
For storing the graph information within a GPU buffer, vertices are assumed to be sequentially numbered and edges are stored within a sparse matrix to conserve memory space.
struct GraphData {
// (V) This contains a pointer to the edge list for each vertex
int *vertexArray;
// Vertex count
int vertexCount;
// (E) This contains pointers to the vertices that each edge is attached to in CSR matrix format.
int *edgeArray;
// (E) This contains pointers to the vertices that each edge is attached to in COO matrix format.
int *edgeArray2;
// Edge count
int edgeCount;
}
Parallelization of Dijkstra’s Algorithm
The OpenCL implementation of Dijkstra for single source shortest paths that I am using is similar to that of the example within the OpenCL Programming Guide. My implementation does not consider the weights of the edges, but keeps track of the number of shortest paths for each vertex in the sigmaArray
. The information for vertex, edge, and current cost per vertex are stored in read-only arrays. The algorithm is split into two stages.
The first stage, any vertex marked for exploration within the maskArray
has its neighbors explored. Initially, only the source vertex is marked for exploration within the maskArray
. By exploring a vertex, its cost is updated within the updatingCostArray
, which is a writable copy of the costArray
.
if ( maskArray[tid] != 0 )
{
maskArray[tid] = 0; // get the index of the last edge for our vertex.
int edgeStart = vertexArray[tid];
int edgeEnd;
if (tid + 1 < (vertexCount))
{
edgeEnd = vertexArray[tid + 1];
}
else
{
edgeEnd = edgeCount;
} // iterate through all neighbors.
for(int edge = edgeStart; edge < edgeEnd; edge++)
{
int nid = edgeArray[edge];
if (updatingCostArray[nid] > (costArray[tid] + 1))
{
updatingCostArray[nid] = (costArray[tid] + 1);
}
if(updatingCostArray[nid] && (updatingCostArray[nid] == (costArray[tid] + 1))) {
atomic_add(&sigmaArray[nid], sigmaArray[tid]); // we found another shortest path
}
}
}
The second stage updates costArray
with any changes from updatingCostArray
that results in a higher cost for that vertex. That vertex is also marked within the maskArray
for exploration for the next iteration of the first stage.
if (costArray[tid] > updatingCostArray[tid])
{
costArray[tid] = updatingCostArray[tid];
maskArray[tid] = 1;
}
updatingCostArray[tid] = costArray[tid];
Both kernels are continuously ran until maskArray
does not contain a 1 at any index.
Parallelization of Edge Betweenness Centrality
Brandes’ algorithm for betweenness centrality states that when calculating the dependency of a vertex, the partial sums of the pair-wise dependencies follow a recurrence relation with each other if they are on the same path.
We start with edges farthest from the source vertex, decrementing our current depth with each iteration of the kernel. edgeArray2
provides a way to access the vertex of an edge, given an edge index. The vertices contained in this array will be the neighbors of vertices contained within edgeArray
(same order of edges). For each pair of connected vertices, v and w, we calculate the partial pair dependency of vertex v and accumulate it to the dependency of v within the dependencyArray
. We are using atomic addition to avoid data races between threads on the GPU.
int v = edgeArray2[tid];
if(costArray[v] != currentDepth)
return;
int w = edgeArray[tid];
// Do we pass v to get to w?
if(costArray[w] == (costArray[v] + 1)) {
if(sigmaArray[w] != 0) {
float sigma_v = (float)sigmaArray[v];
float sigma_w = (float)sigmaArray[w];
atomic_add_global(dependencyArray + v, (sigma_v * 1.0 / sigma_w) * (1 + dependencyArray[w]));
}
}
After running the above kernel for each depth level, the centrality of each vertex is updated, ignoring the current source vertex.
__kernel void finalizeDependency(__global float *centrality, __global float* dependency, int vertexCount, int sourceVertex) {
// access thread id
int v = get_global_id(0);
if(v >= vertexCount)
return;
if(v == sourceVertex)
return;
// shortest paths are counted twice.
centrality[v] += dependency[v] / 2;
}