Go 200,000x Faster in the Field of Weather Analysis with CUDA Python (Numba)

Chen Chieh Yu

Published in

RAPIDS AI

5 min readAug 28, 2020

Authors: Ying Jhang Wu , Chen Chieh Yu, Graham Markall

Introduction: Challenges of weather product development

One goal of the Meteorological Information Center (of Taiwan Center Weather Bureau) is to develop new weather-related products and tools. Its development requires significant testing to complete. It often takes many iterations to go from a concept to production, and most of the time is spent in development. Therefore, little time is left to calm down and think about new problems. Also, there are not many analysis tools in the field of meteorology. We usually develop each new function we want. Furthermore, the Python language is one of the biggest trends in the meteorological domain. The reason why python is successful is because there are numerous powerful libraries that can be easily imported into our own programs, saving precious time.

Motivation: Another possibility for practical evaluation of weather models

The meteorological operation model has long-term statistical verification, and the assessment of the short-term weather system forecasting degree is quite limited. Therefore, I want to develop a set of tools to evaluate the Meiyu front (persistent stationary front) forecast degree of the forecasting model in the short-term. We first choose the Meiyu front system as the test. In many literature studies, the importance of the strong south-westerly jet at the front of the Meiyu front is mentioned. How well does the weather forecast master the jet axis? We must first be able to automatically define the position of the jet axis before we can actually evaluate the possibilities of many weather patterns. In the process of searching for many possible methods to test, python saves development time, but traditionally isn’t performant enough to be useful. If the code in python could be accelerated, it would solve more problems such as front axis, trough and ridge line positioning, and keep development time low.

An Example: Positioning of strong wind axis on Meiyu front

We want to find the strong wind axis, the horizontal connection of many local high wind speed areas, whose direction can help us judge the type of weather system. This search method is similar to looking for the mountain ridge on a terrain. Here is such an algorithm (Ridge detection: The Steepest Ascent Method) in geographic information applications which we can use as a starting point. This method can be imagined as having a climber at each location, each climber will then move to the highest point nearby, and finally insert a flag every time a new destination is reached. It will not stop until the location is the highest point in the surrounding area. Finally, the total number of flags for each data point is calculated. The code is as follows:

CPU-Based Version

Fig 1: Input data (left) and Ridge Detection Algorithm (4 images on the right).

Fig 2: Input image (left) and result (right).

While it will only take two minutes to perform a test, this isn’t practical when there are massive amounts of weather patterns to be evaluated.

Numba could accelerate the example by 200,000x

Only a few changes were needed to go from the pure Python implementation to the CUDA implementation with Numba, and they are fairly common across applications ported to CUDA Python. Let’s look at the individual changes:

Moving allocations outside the kernel: Memory cannot be allocated inside CUDA kernels with Numba, so we allocate the result count from host-side code, and pass it in as a parameter. There are various ways of allocating on-device memory, outlined in the Memory Management documentation.
Parallelization of the outer loops across threads: The serial loops on i and j over the shape of f were replaced with thread-parallel grid-strided loops, where the indices are derived from the Thread ID within the Grid. This 2D iteration space naturally fits well with the use of a 2D Grid. The grid and gridsize functions are explained in the Kernel reference documentation.
Replacement of NumPy function calls with loops and scalar operations: Since CUDA Python doesn’t support calling np.nanargmax and np.nanmax inside a kernel, these are replaced with loops over the elements of f to compute index and vmax. A list of supported Python language features and library functions is provided in the Numba CUDA documentation.
Using atomic operations to avoid races: In the CUDA version, multiple threads can update the same element of count concurrently, which causes races. Replacing the update of each element with cuda.atomic.add updates each element atomically to ensure correctness. Synchronization and Atomic Operations contains a full list of supported atomic operations.

CUDA Python Version:

Conclusion: Insufficient computing speed is a problem that must be encountered more often in the future

The lack of computing speed is a problem that must be encountered more often in the future, and this simple yet efficient method allowed meteorological researchers to optimize their python calculation for massive scale. Thank you Numba! Taking this program as an example, it can increase the speed by nearly 200,000 times, with minimal changes. Although some methods aren’t supported, which will cause additional development time, it is currently so quick to develop in python those can be added. For developers who use python, anyone can easily use the power of the GPU to accelerate their code as we’ve shown. Try it out today!

Please see all code here: https://github.com/chychen/strong_wind_axis_python_cuda

End-to-end code