General-purpose Computing on GPU using Python

Reveal the power of the GPU

Kok Wei Chew
ViTrox-Publication
20 min readNov 3, 2022

--

Photo by Timur Garifov on Unsplash

Preface

Given the right applications and necessary time, the device that can complete the majority of the general computing tasks is known as a general-purpose computer. Personal computers, like desktops, notebooks, smartphones and tablets, are all examples of general-purpose computers. Specialized embedded computers, deployed in intelligent systems, is the other word that differs with general-purpose computers.

In terms of data throughput, Graphical Processing Units (GPUs) perform better than Central Processing Units (CPUs). A GPU is made up of hundreds of cores that work simultaneously on different pieces of data to complete the same task. Thus, a GPU may accelerate particular processes beyond the capabilities of a CPU by pushing enormous amounts of processed data through a workload. For more information about the difference between CPUs and GPUs, please refer to the link below.

Despite having strong libraries like NumPy, which is based on C, Python has long been criticised for its speed. The trade-off for its adaptability for quick prototyping is this. However, high performance computing with Python is not a challenge given the rise of artificial intelligence and the addition of more GPU-accelerated libraries to Python.

The purpose of this article is to introduce readers to general computing using Nvidia GPUs. This article will not address a method for dealing with the lowest level abstraction that requires controlling the kernel and threads. The management of multidimensional arrays — also known as tensors, matrices, or simply arrays in this tutorial — will be the major focus. As a result, we can concentrate our efforts on the right tasks and let professionals handle the lowest level of optimization.

Introduction

Python has a number of libraries and frameworks that support Nvidia GPU computing. Tensorflow and PyTorch are two of them, both of which are developed for deep learning. Array is a well-established data format supported by them. Another library, called CuPy, created by Seiya Tokui, has an API that is nearly equivalent to NumPy for array operations. Tensorflow, PyTorch and CuPy will be further covered for the demonstration of general-purpose computing later.

Table 1: Libraries in Python support generic computing with Nvidia GPU.

Preparation

We must confirm that the necessary libraries are installed before we can proceed. To take advantage of the GPU’s capability, TensorFlow and PyTorch require the GPU version of the libraries, which must be installed. With the appropriate command at the Anaconda prompt, all three libraries may be installed through PIP. For handling statistical calculations, TensorFlow requires an extra package named tensorflow_probability.

The modules are ready to be imported into the scripts for our works after the installation is complete.

Basic Operations

You may either initialize a multidimensional array directly or convert one from/to a NumPy array. Specifying the data type (single/double precision, integer) in each initialization is a good habit. The reason for this is that some processes do not support automated type casting. Furthermore, automatic type casting incurs additional resources.

If our system has more than one GPU, we may choose which GPU will be used for the calculation. While PyTorch’s default setting is CPU, TensorFlow and CuPy’s default setting is GPU with index 0.

Despite the fact that we code in a synchronous manner, the GPU executes the code in an asynchronous manner. When the user attempts to obtain the data or result as a NumPy data type, both the GPU and the CPU will sync. This suggests that the performance, which is the cycle time, can only be evaluated as a whole.

For an instance, an operation consists of S1-A-B-C-S2 with S1: conversion from NumPy, A,B,C: some computations, and S2: conversion to NumPy. Placing checkpoints between AB and BC to measure the duration for operation B alone will not be working.

Basic arithmetic like exponent/logarithm, trigonometry, statistics, and matrix operations are all provided by the aforementioned libraries, and they are all virtually identical to NumPy in this regard. Even if those functions may be divided up into several sub-modules, we may still access them without difficulty.

For arithmetic operations between arrays of different shapes, the broadcasting rules are the same as NumPy. Besides arithmetic operations, we are also constantly dealing with array manipulation like indexing and axes manipulation, like transpose.

In practically every industry, linear algebra has a wide variety of applications. Let’s start with a simple example of solving a system of linear equations with 3 unknowns. The solution involves finding the inverse matrix of the 3x3 coefficient matrix A and a matrix multiplication to the RHS, b.

The singular value decomposition (SVD) of a matrix is another typical topic in linear algebra. SVD is used in principle components analysis (PCA) to determine the principal components (eigenvectors) and their relative variances (eigenvalues). We may locate the eigens by applying SVD to the data’s covariance matrix, as illustrated in Figure 1.

Figure 1: Principle Component Analysis (PCA).

Real Word Examples

As arithmetic-intensive tasks are where the GPU shines, this section will use more complex real-world examples to highlight this fact. The examples discussed as follows include image resizing (interpolation), image denoising (deconvolution), and clustering. Although many pre-existing libraries can carry out these duties, it is not advised to recreate the wheel, and beginning from scratch would help us grasp how to manage the codes and algorithms.

Example 1: Image Resizing (Interpolation)

Highlights:
- Coordinate generation
- Type casting
- Array indexing

Image resizing with arbitrary size requires interpolation of the image with the new coordinates based on old coordinates. The single channel (grayscale) image is treated with a function f(x,y) over a d-dimensional space (x,y). The array indices of the original image is used as the reference for interpolation. For example, to resize an image with 81 pixels in a specific direction to 101 pixels, we have the old coordinates labeled as 0, 1, 2, 3 … 80. We need to ‘squeeze’ the 101 data evenly within the range of 80, in which the new coordinate will be 0, 0.8, 1.6 … 80, as shown in Figure 2. The values of image f(x,y) on any points will be calculated based on their relative location of the respective coordinates and their neighbors.

Figure 2: Coordinates for interpolation in 1 direction.

The interpolation technique that will be covered in this part is linear interpolation, sometimes known as bilinear interpolation [1] in the context of 2D, as shown in Figure 3. Spline interpolation of higher order requires solving a system of linear equations for every pixel. This is a good practice for coding and testing the superiority of GPU in compute-intense tasks. However, we will be looking at the linear interpolation only as it is sufficient for our objective.

Figure 3: Bilinear Interpolation.

The implementation for a resizing operation follows the procedures:

1) Create an array for the new coordinates using the linspace() and meshgrid() function.

2) Get the neighboring pixels (indices) for the specific coordinate using floor() and ceil() method.

3) Calculate the image value at the neighboring pixels.

4) Get the image value at the specific coordinate using formula for bilinear interpolation.

The complete code for the implementation:

Figure 4: The image is resized to arbitrary sizes.

Figure 4 shows image resized with arbitrary sizes. Table 2 and 3 shows the speed comparison for resizing a grayscale image of size 512x512 to different ratios (factors). Due to the higher abstraction of Python language, there is an overhead cost for invoking and initialization of the frameworks in Table 1. In reality, the computation is performed repeatedly for different data (images). Hence, it is reasonable for us to neglect the initialization duration and initialization will not be considered for the rest of comparison in this tutorial.

Table 2: Speed comparison for resizing to different sizes including duration for initialization. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.
Table 3: Speed comparison for resizing to different sizes without including duration for initialization. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.

Image Denoising (Deconvolution)

Highlights:
- Deconvolution with (inverse) Fourier transform
- Array arithmetic

This part will introduce the total variation deconvolution [2] picture denoising approach, which calls for resolving a minimization issue. Total variation deconvolution assume a correction f(x,y) to a noisy image fo(x,y) which satisfy the below regularization:

The minimization problem can be solved using Euler-Lagrange equation:

After discretization, the optimization problem is finally converted into a deconvolution problem from fo(x,y) to f(x,y) with a kernel K:

A deconvolution problem can be solved easily in the Fourier’s domain:

The implementation for a deconvolution operation follows the procedures:

1) Create the convolution kernel K.

2) Perform Fourier transform on the image fo and the kernel K to get FT[fo] and FT[K].

3) Divide the FT[fo] by FT[K] iteratively depends on the number of iterations.

4) Perform inverse Fourier transform on to obtain the final result.

The complete code for the implementation:

Figure 5 shows denoising using codes implemented above with different parameters. Table 4 shows the speed comparison for denoising with images of different sizes and parameters of the algorithm.

Figure 5: Effect of denoising using total variation deconvolution. (Left) original 8-bit grayscale image. (Center) denoise with parameters lambda=2, iteration=3. (Right) denoise with parameters lambda=2, iteration=10.
Table 4: Seed comparison for denoising on images with different sizes and parameters. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.

Clustering (K-means)

Highlights:
- Multi-dimensional array manipulation
- Branchless computing

Clustering is the task of grouping and partitioning a set of data based on their similarity. The algorithm for clustering differs in similarity measure and partitioning methodology. The K-means [3] clustering method, one of the simplest in machine learning, is the clustering algorithm that will be studied in this part.

The K-means algorithm involves partitioning N observations into p clusters in which each observation belongs to the cluster with the nearest mean. Given a set of data (x1, x2, …, xn), the algorithm aims to find a set of clusters {S1, S2, …, Sp} with centroids (m1, m2, …, mp) in which the data are grouped into cluster with nearest distance from the respective centroid.

A standard implementation involves an iterative process of assigning each data to the cluster with the nearest centroids and recalculating the means (centroids) for data assigned to each cluster. This method requires an initialization (guess) of the centroids in order to run. The initial guess is critical and there is plenty of study for good results. The initialization is not the focus here and we will use a random initialization for the implementation. In general, the centroids at the (k+1)-th iteration are calculated based on the distance of the data from the centroids in k-th iteration.

To implement a branch-less operation, we introduce a weight term as assignment for the data to the nearest clusters. The term wij equal to 1 if i-th data belongs to j-th cluster and 0 otherwise .This enables us to perform pure arithmetic operations (at least in python). For a loop-less operation, we used the broadcasting properties of the multidimensional arrays to calculate the distance between each pair of data and centroid, as displayed in Figure 6. The overall shape of the array is (k, N, n_dim).

Figure 6: Shape of arrays in computation.

The implementation for a K-means clustering follows the procedures:

1) Initial guess (randomly generate within the data range)

2) Calculate distance and identify closest cluster

3) Assign points to closest cluster using weightage

4) Recalculate means

The complete code for the implementation:

Figure 7 shows clustered results of an unlabeled 2D data. The result is far from perfect due to the limitation of K-means itself. Another unsupervised algorithm called gaussian mixture modelling (GMM) [4] is better suited to this dataset, as shown in the rightmost of the Figure 7. Table 5 show the comparison for implementation of K-means with different libraries. The K-means from Scikit-Learn library also added for comparison. A pure CPU implementation with NumPy costs about 20% more processing time than Scikit-Learn for large datasets.

Figure 7: Clustering results, each cluster is marked with different color. Their respective centroid is marked ‘X’. (Leftmost) unlabeled data. (Center left) K-mean with 2 iterations. (Center right) K-mean with 20 iterations. (Rightmost) GMM with 20 iterations.
Table 5: Speed comparison for K-means with different dataset sizes and iterations. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.

There is an interesting results showing that CuPy is much slower than TensorFlow and PyTorch in this test. The culprit lies in the operation along the axis. Current version (11.1.0) of CuPy is not optimized for these operations.

Conclusion

The examples showed all of the libraries discussed can accomplish our task very well. Table 5 shows a comparison between each library to help us decide which is the right one for our projects.

Table 6: Comparison between TensorFlow, PyTorch and CuPy.

In a nutshell, Python programming with GPU optimization can be accomplished effortlessly and without any difficulties. The code is quite similar to NumPy’s. Because of this, it is simpler for us to test our method in NumPy before moving on to TensorFlow, PyTorch, or CuPy for acceleration. We can greatly increase speed without sacrificing flexibility. Avoiding many loops and intricate branching is a good rule of thumb when writing Python code. This is valid for both CPU and GPU computing.

Reference:

[1] https://en.wikipedia.org/wiki/Bilinear_interpolation

[2] https://www.ipol.im/pub/art/2012/g-tvdc/article.pdf

[3] https://en.wikipedia.org/wiki/K-means_clustering

[4] https://en.wikipedia.org/wiki/Mixture_model

--

--