Optimizing N-Body Simulation with Barnes-Hut Algorithm and CUDA

5 min readJan 24, 2024

This post examines N-body simulations using the Barnes-Hut algorithm and its parallel implementation in CUDA. We will first cover the basics of N-body simulation and the Barnes-Hut Algorithm, then move on to its parallelized implementation.

N-Body Simulation

N-body simulation in physics involves modeling the behavior of a system of particles, usually under gravitational forces. It’s essential in astrophysics for studying the development of large-scale structures in the universe, including galaxy filaments and the dynamic evolution of star clusters. We will focus on direct gravitational N-body simulations in two-dimensional space.

Computing Gravitational Interactions

The key task is to update the positions and velocities of the N bodies, influenced by their mutual gravitational forces. For each body, we calculate its acceleration based on the gravitational forces acting on it. Using a direct-sum approach, which computes all pairwise forces, would need O(N²) operations per time step. In contrast, hierarchical tree-based methods, like the Barnes-Hut Algorithm, can reduce this complexity to O(N log N) for uniform distributions.

Formulas

Let’s examine the basic mathematical formulas for force calculation:

Force:

**Figure 1:** Newton’s law of universal gravitation [1]

Acceleration:

Velocity:

Position:

Barnes-Hut Algorithm

The Barnes-Hut algorithm in N-body simulations simplifies force computations using a hierarchical quadtree in two dimensions.

It involves two primary steps:

1. Quadtree Construction

A quadtree is a tree data structure where each node represents a two-dimensional space. The root node covers the entire space, and its four children represent its four quadrants. The construction process recursively divides the root node, containing all N bodies, into four child nodes (quadrants), until each node holds at most one body. Each node stores the bound it covers, along with the total mass and the center of mass of all bodies within its bound.

2. Force Computation

To compute the force on a body, we traverse the quadtree starting from the root node. If a node is far enough from the body, we approximate the force using that node’s center of mass and total mass. Whether a node is far enough from a body depends on the quotient s/d, where s is the width of the bound represented by the node, and d is the distance between the body and the node’s center of mass. If the node is not far enough, we explore each subtree node for a more precise calculation.

With an understanding of the Barnes-Hut algorithm, our next step is to explore its parallelized implementation in CUDA.

Barnes-Hut CUDA Implementation Overview

Note: This is a simplified overview. For full technical details, please refer to the paper.

Prerequisites

A basic understanding of CUDA programming is essential before delving into the implementation details.

Parallelization Suitability and Challenges

The N-body problem is inherently parallelizable, particularly in force calculations. Parallelizing the Barnes-Hut algorithm, however, presents unique challenges:

Repeated construction of an irregular tree structure.
Extensive pointer-chasing memory operations.
Typically implemented recursively.

Data Structures

Key structures include:

Body: Stored in an array with attributes like mass, radius, position, velocity, and acceleration.
Quadtree: Stored in a fixed-size array instead of a dynamic structure for efficiency.
Node Attributes: Center of mass, total mass, bounding box, and contained bodies.

Algorithm Overview

The algorithm involves four main steps (kernels) per time step:

1. Reset Quadtree

Resets each quadtree node to its default state in parallel.

2. Compute Bounding Box

Determine the bounding box for all bodies at the root node. Bodies are divided into groups, assigned to blocks, which compute the bounding box in parallel, and then update the root node’s bounding box atomically.

3. Construct Quadtree

**Figure 4:** Dynamic Parallelism Example [4]

The quadtree is constructed top-down using dynamic parallelism to avoid locks and synchronization. The process starts with a single block representing the root node that covers the entire space, which contains all N bodies. This block is then subdivided into four quadrants, with each new block managing one quadrant. The subdivision process continues until it reaches a leaf node or a block with only one body.

During this construction, a block carries out three main steps:

Compute the Center of Mass: The block calculates the center of mass and the total mass for the bodies in its assigned quadrant.
Assign Bodies to Quadrants: Before launching new blocks, bodies are assigned to specific blocks depending on which quadrant they belong to.
Launch New Blocks: The block launches four new blocks, one for each quadrant.

4. Compute Force

Forces on each body are calculated using parallel tree traversal with the constructed quadtree.

Evaluations

**Figure 5:** Barnes-Hut Implementations Runtime

Figure 5 shows the runtime for each kernel and compares them with the CPU implementation, for 5,000,000 bodies. We use the CPU/GPU time ratio as an estimate of the number of CPU cores required to match GPU performance. Notably, Kernel 4 (compute force) is by far the most time-consuming, followed by Kernel 3 (construct quadtree). The first two kernels account for less than 1% of the total runtime in both implementations.

In a 5,000,000 body simulation, the parallelized implementation has a runtime of 1.5 seconds per timestep. This highlights the effectiveness of parallelization, particularly in handling the algorithm’s most computationally intensive steps.

N-Body Simulation — Galaxy Collision Visualization

N-Body Simulation — Spiral Galaxy Visualization

The source code of the Barnes-Hut CUDA implementation discussed in this post is available here.

References

[1] Wikipedia. Newton’s law of universal gravitation.
[2] Tom Ventimiglia and Kevin Wayne. The barnes-hut algorithm.
[3] Martin Burtscher and Keshav Pingali. An efficient cuda implementation of the tree-based barnes hut n-body algorithm.
[4] J. Gómez-Luna and I. El Hajj. CUDA dynamic parallelism.