Let me start this post by saying that it is going to be looong… On a flip side, however, I will summarize my teams’ current research & development efforts related to CFD simulations acceleration.
Some background though…
Team at byteLAKE has created a set of highly optimized CFD kernels that leverage the speed and energy efficiency of Xilinx Alveo FPGA accelerator cards to create a high-performance platform for complex engineering analysis.
Kernels can be directly adapted to the geophysical models such as EULAG (Eulerian/semi-Lagrangian) fluid solver, designed to simulate the all-scale geophysical flows.
The algorithms have been extended by additional quantities as forces (implosion, explosion) and density vectors. In addition, they allow users to fully configure the border conditions (periodic, open).
What is CFD then?
CFD, Computational Fluid Dynamics tools combine numerical analysis and algorithms to solve fluid flows problems. A range of industries such as automotive, chemical, aerospace, biomedical, power and energy, and construction rely on fast CFD analysis turnaround time. It is a key part of their design workflow to understand and design how liquids and gases flow and interact with surfaces.
Typical applications include weather simulations, aerodynamic characteristics modelling and optimization, and petroleum mass flow rate assessment.
Why acceleration matters?
The ever-increasing demand for accuracy and capabilities of the CFD workloads produces an exponential growth of the required computational resources. Moving to heterogeneous HPC (High Performance Computing) configurations powered by Xilinx Alveo helps significantly improve performance within radically reduced energy budgets. Eventually you get the results faster and within radically reduced energy budgets. And both of these factors help you drive the TCO down.
Kernels we adapted and optimized
- Kernel: Advection (movement of some material, dissolved or suspended in the fluid)
First-order step of the non-linear iterative upwind advection MPDATA (Multidimensional Positive Definite Advection Transport Algorithm) schemes.
- Kernel: Pseudo velocity (approximation of the relative velocity)
Computation of the pseudo velocity for the second pass of upwind algorithm in MPDATA.
- Kernel: Divergence (measures how much of fluid is flowing into/ out of a certain point in a vector field)
Divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme.
- Kernel: Thomas algorithm (simplified form of Gaussian elimination for tridiagonal system of equations)
Tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver. Preconditioner operates on the diagonal part of the full linear problem. Effective preconditioning lies at the heart of multiscale flow simulation, including a broad range of geoscientific applications.
Quickly about the results so far…
- 4x faster results
- 80% lower energy consumption
- Up to 6x better Performance per Watt
Optimizing CFD codes for Alveo
The goal of the work was to adapt 4 CFD kernels to ALVEO U250 FPGA. All the kernels use 3-dimensional compute domain consisting of 7 (Thomas) to 11 (pseudovelocity) arrays. Also, the computations are performed with a stencil fashion (to compute a single element of a compute domain it is required to access the neighboring elements). Since all the kernels belong to a group of memory bound algorithms, our main challenge was to provide the highest utilization of the global memory bandwidth.
The ALVEO U250 FPGA consists of 4 global memory banks, where each of them is connected to a single super logic region (SLR). To address this design feature the compute domain was divided into 4 sub-domains, where each of them was assigned to a separate memory bank. Each kernel was distributed across 4 compute units assigned to a different SLR. In this way, the memory transfers between the global memory and compute units occurred only between connected pairs of SLR and memory bank.
- Kernel is distributed into 4 SLRs
- Each sub-domain is allocated in different memory bank
- Data transfer occurs between neighboring memory banks
To update the data between the memory banks it is required to exchange halo areas (borders of sub-domain) between neighboring sub-domains. For this purpose, we utilized a new memory object called pipe. A pipe stores data organized as a FIFO. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency.
To minimize global memory traffic, we utilized fast BRAM memory. The characteristic of stencil computation requires to access a single array many times. Since there is not enough memory space to store 3D blocks of compute domains, we utilized the 2.5D blocking technique to provide data locality. For this purpose, we stored only a small set of planes for each domain, which was stored as a queue of planes. After each iteration, only a single plane was downloaded from the global memory, while others migrated across the queue. In this way the global memory traffic was minimized.
- Each array is transferred from the global memory to the fast BRAM memory
- To minimize the data traffic, we use a memory queue across iterations
Another key optimization was to organize the computation in a SIMD fashion by utilizing vector data types of size 16. It allowed us to utilize a 512-bit AXI4 memory interface for global memory access.
Lastly, here goes a summary of how each optimization let us speed up the execution of the advection kernel, ultimately cutting the time of execution from almost 600s to roughly ~1s.
- O1: 1 SLR basic implementation, loop pipelining
- O2: memory bank assignment, memory alignment
- O3 :2 SLRs and 2 memory banks, using pipes to kernel communication, loop tiling
- O4: vectorization added, kernel communication thru data copying
- O5: BRAM used
- O6: critical path optimized
- O7: queue on BRAM memory
- O8: optimization flags
- O9: memory pins reduction (3SLRs), border conditions optimization and global memory traffic reduced
- O10+: 4 SLRs
It is worth mentioning that the above techniques translated into a 600x speedup!
To compare the results, we also highly optimized the code for CPU-only architectures. So let’s quickly jump into some details there as well…
CPU optimization for reference
Our initial CPU implementation utilized 2 CPU processors: Intel® Xeon® CPU E5–2695 v2 2.40–3.2 GHz (2x12 cores). Then we compared the results with several other configurations, including: 1 * Intel Xeon E5–2695 CPU 2.4GHz — IvyBridge (Ivy), Intel Xeon Gold 6148 CPU 2.4GHZ — SkyLake (Gold) and Intel Xeon Platinum 8168 CPU 2.7GHZ — SkyLake (Platinum).
To optimize the code, we implemented several techniques like:
- all the available cores utilization
- loop transformations
- memory alignment
- thread affinity
- data locality within nested loops
- compiler optimizations
Depending on a kernel, the above techniques translated into an almost 92x speedup!
For the configuration with 2 CPUs (Ivy) we reached the maximum throughput of: 3.7 GB/s. Corresponding power dissipation was: 142 Watts.
The configuration with FPGA resulted in almost 6 GB/s throughput and the power dissipation of slightly above 100 W.
Also, still speaking of FPGA, the results for the advection kernel were as follows:
- Read Data: 467.92704 GB (domain 1020x510x64; time steps: 500)
- Execution time: 9.96 s
- Throughput: 46.981 GB/s
It is very important to note that
we reached 98.32% of the maximum attainable throughput.
And in that case, we optimized the performance to the level that the time of execution was completely “hidden” behind the time of the data transfer with a maximum possible throughput. Also, we can say that we reached the best possible optimization for the given CFD kernel.
And here are the results for various configurations:
As we can see, a single Alveo U250 card was able to outperform even Intel Xeon Platinum 8168 CPU, delivering results slightly faster and at a significantly lower energy budget.
It is important to emphasize, that the presented results are for single kernels. Typical CFD applications consists of many kernels which execute various operations on the same data. Therefore, we should expect even better results due to further possibility to optimize the computations and reuse data across all the kernels.
Moving fluid simulations to heterogeneous computing architectures powered by Alveo FPGA delivers faster results within significantly reduced energy budgets. For instance, nodes equipped with Alveo U250 deliver up to 4x speedup while reducing the energy consumption by almost 80% vs. CPU-only nodes. As these algorithms are memory bound, upgrading the configuration to U280 (equipped with HBM) gives additional speedup and helps reduce the energy budgets further.
CFD market needs speedup, heterogeneous architectures ready solutions and scalability. Alveo products family addresses these challenges very well. Moreover, CFD codes fit well into the Alveo architecture features such as multi banks, BRAM utilization, pipelining, vectorization etc.
FPGAs introduce certain limitations like for example:
- the performance drops when we exceed the maximum supported amount of computations that can be executed at highest throughput. Beyond that point we need to share the resources
- available bandwidth between the host memory and FPGA can limit the overall performance
Therefore, FPGAs are good candidates for applications where:
- most of the computations are organized in small codes which are repeated on larger data portions
- computational load is much higher than input / output operations, utilizing and gradually saturating the host memory — FPGA available bandwidth.
Based on this, CFD codes in general are very good candidates to benefit from the FPGA architectures.
Some comments about FPGA benefits over CPU
- Memory hierarchy: integrated global memory vs. external RAM memory
Access to global memory is a bottleneck for CFD codes (in our case)
In the FPGA the global memory is integrated with the accelerator that allows us to fully utilize all memory banks in parallel and gives us better access to it with higher bandwidth (DDR4/HBM) comparing with the external RAM memory of CPU
- Higher parallelism in FPGA vs. higher frequency in CPU:
A large set of arithmetic and logic units (1.3 M of LUTs) allow us to perfectly hide up to 90% of computation behind the data transfers
Thanks to this solution we can beat 2.5 GHz of CPU with 300/500 MHz of FPGA
Lower frequency is also more energy efficient
- Ultra-fast BRAM memory vs. cache memory
Small stencil structure of CFD kernels does not require to use a big cache memory of the CPU. Small but ultra-fast BRAM is good enough to provide data reusing and data locality that allows us to reduce the global memory traffic comparable with a CPU, where the cache memory is bigger.