General-purpose Computing on GPU using Python

Reveal the power of the GPU

Kok Wei Chew

Published in

ViTrox-Publication

20 min readNov 3, 2022

Preface

Given the right applications and necessary time, the device that can complete the majority of the general computing tasks is known as a general-purpose computer. Personal computers, like desktops, notebooks, smartphones and tablets, are all examples of general-purpose computers. Specialized embedded computers, deployed in intelligent systems, is the other word that differs with general-purpose computers.

In terms of data throughput, Graphical Processing Units (GPUs) perform better than Central Processing Units (CPUs). A GPU is made up of hundreds of cores that work simultaneously on different pieces of data to complete the same task. Thus, a GPU may accelerate particular processes beyond the capabilities of a CPU by pushing enormous amounts of processed data through a workload. For more information about the difference between CPUs and GPUs, please refer to the link below.

CPU vs. GPU | Best Use Cases For Each | WEKA

WEKA. September 15, 2021 Are you wondering about CPU vs. GPU? We explain how each works and the best use cases for…

www.weka.io

Despite having strong libraries like NumPy, which is based on C, Python has long been criticised for its speed. The trade-off for its adaptability for quick prototyping is this. However, high performance computing with Python is not a challenge given the rise of artificial intelligence and the addition of more GPU-accelerated libraries to Python.

The purpose of this article is to introduce readers to general computing using Nvidia GPUs. This article will not address a method for dealing with the lowest level abstraction that requires controlling the kernel and threads. The management of multidimensional arrays — also known as tensors, matrices, or simply arrays in this tutorial — will be the major focus. As a result, we can concentrate our efforts on the right tasks and let professionals handle the lowest level of optimization.

Introduction

Python has a number of libraries and frameworks that support Nvidia GPU computing. Tensorflow and PyTorch are two of them, both of which are developed for deep learning. Array is a well-established data format supported by them. Another library, called CuPy, created by Seiya Tokui, has an API that is nearly equivalent to NumPy for array operations. Tensorflow, PyTorch and CuPy will be further covered for the demonstration of general-purpose computing later.

Table 1: Libraries in Python support generic computing with Nvidia GPU.

Preparation

We must confirm that the necessary libraries are installed before we can proceed. To take advantage of the GPU’s capability, TensorFlow and PyTorch require the GPU version of the libraries, which must be installed. With the appropriate command at the Anaconda prompt, all three libraries may be installed through PIP. For handling statistical calculations, TensorFlow requires an extra package named tensorflow_probability.

# install tensorflow
pip install tensorflow tensorflow-gpu tensorflow_probability# install pytorch (compile with cuda toolkit 11.3)
pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 torchaudio===0.10.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html# install cupy
pip install cupy

The modules are ready to be imported into the scripts for our works after the installation is complete.

import numpy
import tensorflow as tf
import torch
import cupyprint("Numpy version:", numpy.__version__)   # 1.23.3
print("Tensorflow version:", tf.__version__) # 2.10.0
print("Pytorch version:", torch.__version__) # 1.11.0+cu113
print("Cupy version:", cupy.__version__)     # 11.1.0

Basic Operations

You may either initialize a multidimensional array directly or convert one from/to a NumPy array. Specifying the data type (single/double precision, integer) in each initialization is a good habit. The reason for this is that some processes do not support automated type casting. Furthermore, automatic type casting incurs additional resources.

########################################
# array declaration
########################################arr_shape = (3,4,5)
# numpy
arr0 = numpy.zeros(arr_shape, dtype=numpy.float32)
arr1 = numpy.ones(arr_shape, dtype=numpy.float32)
arr2 = numpy.random.rand(*arr_shape).astype(numpy.float32)# tensorflow
arr0_tf = tf.zeros(arr_shape, dtype=tf.dtypes.float32)
arr1_tf = tf.ones(arr_shape, dtype=tf.dtypes.float32)
arr2_tf = tf.random.uniform(arr_shape, minval=0, maxval=1, dtype=tf.float32)# pytorch
arr0_torch = torch.zeros(arr_shape, dtype=torch.float32)
arr1_torch = torch.ones(arr_shape, dtype=torch.float32)
arr2_torch = torch.rand(arr_shape, dtype=torch.float32)# cupy
arr0_cupy = cupy.zeros(arr_shape, dtype=cupy.float32)
arr1_cupy = cupy.ones(arr_shape, dtype=cupy.float32)
arr2_cupy = cupy.random.rand(*arr_shape).astype(cupy.float32)########################################
# conversion from/to numpy
########################################arr_shape = (3,4)
arr1 = numpy.ones(arr_shape, dtype=numpy.float32)# tensorflow
arr1_tf = tf.constant(arr1) # or tf.Variable
result_in_numpy = arr1_tf.numpy()# pytorch
arr1_torch = torch.tensor(arr1)
result_in_numpy = arr1_torch.cpu().detach().numpy()# cupy
arr1_cupy = cupy.array(arr1)
result_in_numpy= cupy.asnumpy(arr1_cupy)

If our system has more than one GPU, we may choose which GPU will be used for the calculation. While PyTorch’s default setting is CPU, TensorFlow and CuPy’s default setting is GPU with index 0.

# arrays
arr1 = numpy.ones((30,40,50), dtype=numpy.float32)
arr2 = 2*numpy.ones((30,40,50), dtype=numpy.float32)# tensorflow
device = '/device:GPU:0' # '/CPU:0' or '/device:GPU:0' etc
with tf.device(device):
    arr1_tf = tf.constant(arr1)
    arr2_tf = tf.constant(arr2)
    arr3_tf = arr1_tf+arr2_tf
    result_tf = arr3_tf.numpy()
    print(result_tf)# pytorch
device = 0 # 'cpu' or 0, 1 etc
arr1_torch = torch.tensor(arr1).to(device)
arr2_torch = torch.tensor(arr2).to(device)
arr3_torch = arr1_torch+arr2_torch
result_torch = arr3_torch.cpu().detach().numpy()
print(result_torch)# cupy
device = 0 # 0,1 etc
with cupy.cuda.Device(device):
    arr1_cupy = cupy.array(arr1)
    arr2_cupy = cupy.array(arr2)
    arr3_cupy = arr1_cupy+arr2_cupy
    result_cupy = cupy.asnumpy(arr3_cupy)
    print(result_cupy)

Despite the fact that we code in a synchronous manner, the GPU executes the code in an asynchronous manner. When the user attempts to obtain the data or result as a NumPy data type, both the GPU and the CPU will sync. This suggests that the performance, which is the cycle time, can only be evaluated as a whole.

For an instance, an operation consists of S1-A-B-C-S2 with S1: conversion from NumPy, A,B,C: some computations, and S2: conversion to NumPy. Placing checkpoints between AB and BC to measure the duration for operation B alone will not be working.

Basic arithmetic like exponent/logarithm, trigonometry, statistics, and matrix operations are all provided by the aforementioned libraries, and they are all virtually identical to NumPy in this regard. Even if those functions may be divided up into several sub-modules, we may still access them without difficulty.

########################################
# numpy
########################################
# exponent/logarithm
a1 = numpy.power(arr1, 2.1)
a2 = numpy.exp(arr1)
a3 = numpy.log(arr1)
# trigonometry/hyperbolic
b1 = numpy.sin(arr1)
b2 = numpy.arccos(arr1)
b3 = numpy.sinh(arr1)
b4 = numpy.arctanh(arr1)
# computation along axis
c1 = numpy.sum(arr1, axis=(0,1))
c2 = numpy.var(arr1, axis=(0,1))
c3 = numpy.sort(arr1, axis=2)
# vector/matrix operation
d1 = numpy.tensordot(arr1,arr2, axes=((1,2),(0,1)))
d2 = numpy.einsum('ii', arr1)
# signal process
e1 = numpy.fft.fft(arr1)
e2 = numpy.fft.ifft2(arr1)########################################
# tensorflow
########################################
# exponent/logarithm
a1 = tf.math.pow (arr1, 2.1)
a2 = tf.math.exp (arr1)
a3 = tf.math.log(arr1)
# trigonometry/hyperbolic
b1 = tf.sin(arr1)
b2 = tf.acos(arr1)
b3 = tf.sinh(arr1)
b4 = tf.atanh(arr1)
# operation along axis
c1 = tf.math.reduce_sum(arr1, axis=(0,1))
c2 = tf.math.reduce_variance(arr1, axis=(0,1))
c3 = tf.sort(arr1, axis=2)
# vector/matrix operation
d1 = tf.tensordot(arr1,arr2, axes=((1,2),(0,1)))
d2 = tf.einsum('ii', arr1)
# signal process
e1 = tf.signal.fft(arr1)
e2 = tf.signal.ifft2d(arr1)########################################
# pytorch
########################################
# exponent/logarithm
a1 = torch.pow(arr1, 2.1)
a2 = torch.exp(arr1)
a3 = torch.log(arr1)
# trigonometry/hyperbolic
b1 = torch.sin(arr1)
b2 = torch.arccos(arr1)
b3 = torch.sinh(arr1)
b4 = torch.arccosh(arr1)
# computation along axis
c1 = torch.sum(arr1, dim=(0,1))
c2 = torch.var(arr1, dim=(0,1))
c3 = torch.sort(arr1, dim=2)
# vector/matrix operation
d1 = torch.tensordot(arr1,arr2, dims=[[1,2],[0,1]])
d2 = torch.einsum('ii', arr1)
# signal process
e1 = torch.fft.fft(arr1)
e2 = torch.fft.ifft2(arr1)########################################
# cupy
########################################
# exponent/logarithm
a1 = cupy.power(arr1, 2.1)
a2 = cupy.exp(arr1)
a3 = cupy.log(arr1)
# trigonometry/hyperbolic
b1 = cupy.sin(arr1)
b2 = cupy.arccos(arr1)
b3 = cupy.sinh(arr1)
b4 = cupy.arctanh(arr1)
# computation along axis
c1 = cupy.sum(arr1, axis=(0,1))
c2 = cupy.var(arr1, axis=(0,1))
c3 = cupy.sort(arr1, axis=2)
# vector/matrix operation
d1 = cupy.tensordot(arr1,arr2, axes=((1,2),(0,1)))
d2 = cupy.einsum('ii', arr1)
# signal process
e1 = cupy.fft.fft(arr1)
e2 = cupy.fft.ifft2(arr1)

For arithmetic operations between arrays of different shapes, the broadcasting rules are the same as NumPy. Besides arithmetic operations, we are also constantly dealing with array manipulation like indexing and axes manipulation, like transpose.

########################################
# type casting
######################################### numpy
arr1 = numpy.zeros((3,4,5), dtype=numpy.float32)
arr2 = arr1.astype(numpy.int32)
print(type(arr2), arr2.dtype)# tensorflow
arr1_tf = tf.Variable(arr1)
arr2_tf = tf.cast(arr1_tf, tf.int32)
print(type(arr2_tf), arr2_tf.dtype)# pytorch
arr1_torch = torch.tensor(arr1)
arr2_torch = arr1_torch.type(torch.int32)
print(type(arr2_torch), arr2_torch.dtype)# cupy
arr1_cupy = cupy.array(arr1)
arr2_cupy = arr1_cupy.astype(cupy.int32)
print(type(arr2_cupy), arr2_cupy.dtype)
########################################
# indexing
######################################### numpy
arr1 = numpy.arange(60).reshape((3,4,5)).astype(numpy.float32)
arr2 = arr1[1,:,2:-1]
print(arr2)# tensorflow
arr1_tf = tf.constant(arr1) # or tf.Variable
arr2_tf = arr1_tf[1,:,2:-1]
print(arr2_tf)# pytorch
arr1_torch = torch.tensor(arr1)
arr2_torch = arr1_torch[1,:,2:-1]
print(arr2_torch)# cupyarr1_cupy = cupy.array(arr1)
arr2_cupy = arr1_cupy[1,:,2:-1]
print(arr2_cupy)
########################################
# item assignment
######################################### numpy
arr1 = numpy.zeros((3,4,5), dtype=numpy.float32)
arr2 = numpy.ones((4,5), dtype=numpy.float32)
arr1[1,:,:] = arr2
print(arr1)# tensorflow
arr1_tf = tf.Variable(arr1) # only tf.Variable support item assignment
arr2_tf = tf.Variable(arr2)
arr1_tf[1,:,:].assign(arr2_tf)
print(arr1_tf)# pytorch
arr1_torch = torch.tensor(arr1)
arr2_torch = torch.tensor(arr2)
arr1_torch[1,:,:] = arr2_torch
print(arr1_torch)# cupy
arr1_cupy = cupy.array(arr1)
arr2_cupy = cupy.array(arr2)
arr1_cupy[1,:,:] = arr2_cupy
print(arr1_cupy)
########################################
# broadcasting
########################################

# numpy
arr1 = numpy.zeros((1,1,5), dtype=numpy.float32)
arr2 = numpy.ones((3,1,1), dtype=numpy.float32)
arr3 = arr1 + arr2
print(arr3.shape)# tensorflow
arr1_tf = tf.constant(arr1)
arr2_tf = tf.constant(arr2)
arr3_tf = arr1_tf + arr2_tf
print(arr3_tf.shape)# pytorch
arr1_torch = torch.tensor(arr1)
arr2_torch = torch.tensor(arr2)
arr3_torch = arr1_torch + arr2_torch
print(arr3_torch.shape)# cupy
arr1_cupy = cupy.array(arr1)
arr2_cupy = cupy.array(arr2)
arr3_cupy = arr1_cupy + arr2_cupy
print(arr3_cupy.shape)
########################################
# concatenation
######################################### numpy
arr1 = numpy.zeros((3,4,5), dtype=numpy.float32)
arr2 = numpy.ones((3,4,5), dtype=numpy.float32)
arr3 = numpy.concatenate([arr1,arr2], axis=1)
print(arr3.shape)# tensorflow
arr1_tf = tf.constant(arr1)
arr2_tf = tf.constant(arr2)
arr3_tf = tf.concat([arr1_tf,arr2_tf], axis=1)
print(arr3_tf.shape)# pytorch
arr1_torch = torch.tensor(arr1)
arr2_torch = torch.tensor(arr2)
arr3_torch = torch.concat([arr1_torch,arr2_torch], dim=1)
print(arr3_torch.shape)# cupy
arr1_cupy = cupy.array(arr1)
arr2_cupy = cupy.array(arr2)
arr3_cupy = cupy.concatenate([arr1_cupy,arr2_cupy], axis=1)
print(arr3_cupy.shape)

########################################
# axis manipulation
######################################### numpy
arr1 = numpy.arange(60).reshape((3,4,5)).astype(numpy.float32)
arr2 = numpy.swapaxes(arr1, axis1=0, axis2=1)
print(arr2.shape)# tensorflow
arr1_tf = tf.Variable(arr1)
arr2_tf = tf.experimental.numpy.swapaxes(arr1_tf, axis1=0, axis2=1)
print(arr2_tf.shape)# pytorch
arr1_torch = torch.tensor(arr1)
arr2_torch = torch.swapaxes(arr1_torch, axis0=0, axis1=1)
print(arr2_torch.shape)# cupy
arr1_cupy = cupy.array(arr1)
arr2_cupy = cupy.swapaxes(arr1_cupy, axis1=0, axis2=1)
print(arr2_cupy.shape)

In practically every industry, linear algebra has a wide variety of applications. Let’s start with a simple example of solving a system of linear equations with 3 unknowns. The solution involves finding the inverse matrix of the 3x3 coefficient matrix A and a matrix multiplication to the RHS, b.

# numpy
A = numpy.asarray([[1,1,1],[3,-1,2],[4,2,-3]], dtype=numpy.float32)
b = numpy.asarray([6,7,-1], dtype=numpy.float32)
A_inverse = numpy.linalg.inv(A)
X = numpy.tensordot(A_inverse,b, axes=(1,0))

# tensorflow
A = tf.constant([[1,1,1],[3,-1,2],[4,2,-3]], dtype=tf.dtypes.float32)
b = tf.constant([6,7,-1], dtype=tf.dtypes.float32)
A_inverse = tf.linalg.inv(A)
X = tf.tensordot(A_inverse,b, axes=(1,0))

# pytorch
A = torch.tensor([[1,1,1],[3,-1,2],[4,2,-3]], dtype=torch.float32).to(0)
b = torch.tensor([6,7,-1], dtype=torch.float32).to(0)
A_inverse = torch.linalg.inv(A)
X = torch.tensordot(A_inverse,b, dims=[[1],[0]])

# cupy
A = cupy.asarray([[1,1,1],[3,-1,2],[4,2,-3]], dtype=cupy.float32)
b = cupy.asarray([6,7,-1], dtype=cupy.float32)
A_inverse = cupy.linalg.inv(A)
X = cupy.tensordot(A_inverse,b, axes=(1,0))

The singular value decomposition (SVD) of a matrix is another typical topic in linear algebra. SVD is used in principle components analysis (PCA) to determine the principal components (eigenvectors) and their relative variances (eigenvalues). We may locate the eigens by applying SVD to the data’s covariance matrix, as illustrated in Figure 1.

Figure 1: Principle Component Analysis (PCA).

# numpy
mean = numpy.mean(raw_data, axis=0) # raw_data shape (n_data,n_dim)
covar = numpy.cov(raw_data.T)
U, S, VT = cupy.linalg.svd(covar)# tensorflow (requires tensorflow_probability)
raw_data_tf = tf.constant(raw_data, dtype=tf.dtypes.float32)
mean = tf.reduce_mean(raw_data_tf, axis=0)
covar = tfp.stats.covariance(raw_data_tf, sample_axis=0)
S, U, VT = tf.linalg.svd(covar)# pytorch
raw_data_torch = torch.tensor(raw_data).to(0)
mean = torch.mean(raw_data_torch, dim=0)
covar = torch.cov(raw_data_torch.T)
U, S, VT = torch.linalg.svd(covar)# cupy
raw_data_cupy = cupy.array(raw_data)
mean = cupy.mean(raw_data_cupy, axis=0)
covar = cupy.cov(raw_data_cupy.T)
U, S, VT = cupy.linalg.svd(covar)

Real Word Examples

As arithmetic-intensive tasks are where the GPU shines, this section will use more complex real-world examples to highlight this fact. The examples discussed as follows include image resizing (interpolation), image denoising (deconvolution), and clustering. Although many pre-existing libraries can carry out these duties, it is not advised to recreate the wheel, and beginning from scratch would help us grasp how to manage the codes and algorithms.

Example 1: Image Resizing (Interpolation)

Highlights:
- Coordinate generation
- Type casting
- Array indexing

Image resizing with arbitrary size requires interpolation of the image with the new coordinates based on old coordinates. The single channel (grayscale) image is treated with a function f(x,y) over a d-dimensional space (x,y). The array indices of the original image is used as the reference for interpolation. For example, to resize an image with 81 pixels in a specific direction to 101 pixels, we have the old coordinates labeled as 0, 1, 2, 3 … 80. We need to ‘squeeze’ the 101 data evenly within the range of 80, in which the new coordinate will be 0, 0.8, 1.6 … 80, as shown in Figure 2. The values of image f(x,y) on any points will be calculated based on their relative location of the respective coordinates and their neighbors.

Figure 2: Coordinates for interpolation in 1 direction.

The interpolation technique that will be covered in this part is linear interpolation, sometimes known as bilinear interpolation [1] in the context of 2D, as shown in Figure 3. Spline interpolation of higher order requires solving a system of linear equations for every pixel. This is a good practice for coding and testing the superiority of GPU in compute-intense tasks. However, we will be looking at the linear interpolation only as it is sufficient for our objective.

The implementation for a resizing operation follows the procedures:

1) Create an array for the new coordinates using the linspace() and meshgrid() function.

ny0, nx0 = img.shape
x_new = numpy.linspace(0,nx0-1, new_nx)
y_new = numpy.linspace(0,ny0-1, new_ny)
yy_new, xx_new = numpy.meshgrid(y_new, x_new, indexing='ij')
xi = xx_new.flatten()
yi = yy_new.flatten()

2) Get the neighboring pixels (indices) for the specific coordinate using floor() and ceil() method.

x1 = numpy.floor(xi)
x2 = numpy.ceil(xi)
y1 = numpy.floor(yi)
y2 = numpy.ceil(yi)

3) Calculate the image value at the neighboring pixels.

x1_idx = x1.astype(int)
x2_idx = x2.astype(int)
y1_idx = y1.astype(int)
y2_idx = y2.astype(int)
f11 = img[y1_idx,x1_idx]
f12 = img[y1_idx,x2_idx]
f21 = img[y2_idx,x1_idx]
f22 = img[y2_idx,x2_idx]

4) Get the image value at the specific coordinate using formula for bilinear interpolation.

delta = 1.0e-9 # prevent division by 0
f1x = f11 + (f12-f11) * (xi-x1) / numpy.maximum(x2-x1, delta)
f2x = f21 + (f22-f21) * (xi-x1) / numpy.maximum(x2-x1, delta)
fyx = f1x + (f2x-f1x) * (yi-y1) / numpy.maximum(y2-y1, delta)

The complete code for the implementation:

# numpy
def resize(img, new_nx, new_ny):
    ny0, nx0 = img.shape
    # coordinate for interpolation
    x_new = numpy.linspace(0,nx0-1, new_nx)
    y_new = numpy.linspace(0,ny0-1, new_ny)
    yy_new, xx_new = numpy.meshgrid(y_new, x_new, indexing='ij')
    xi = xx_new.flatten()
    yi = yy_new.flatten()
    # get neighboring indices
    x1 = numpy.floor(xi)
    x1 = numpy.minimum(numpy.maximum(x1,0),nx0-1)
    x2 = numpy.ceil(xi)
    x2 = numpy.minimum(numpy.maximum(x2,0),nx0-1)
    y1 = numpy.floor(yi)
    y1 = numpy.minimum(numpy.maximum(y1,0),ny0-1)
    y2 = numpy.ceil(yi)
    y2 = numpy.minimum(numpy.maximum(y2,0),ny0-1)
    # profile at neighboring index
    x1_idx = x1.astype(int)
    x2_idx = x2.astype(int)
    y1_idx = y1.astype(int)
    y2_idx = y2.astype(int)
    f11 = img[y1_idx,x1_idx]
    f12 = img[y1_idx,x2_idx]
    f21 = img[y2_idx,x1_idx]
    f22 = img[y2_idx,x2_idx]
    # interpolate
    delta = 1.0e-9 # prevent division by 0
    f1x = f11 + (f12-f11) * (xi-x1) / numpy.maximum(x2-x1, delta)
    f2x = f21 + (f22-f21) * (xi-x1) / numpy.maximum(x2-x1, delta)
    fyx = f1x + (f2x-f1x) * (yi-y1) / numpy.maximum(y2-y1, delta)
    return fyx.reshape((new_ny,new_nx))# tensorflow
def resizeTf(img, new_nx, new_ny, device='/device:GPU:0'):
    ny0, nx0 = img.shape
    with tf.device(device):
        img_tf = tf.constant(img, dtype=tf.float32)
        # coordinate for interpolation
        x_new = tf.linspace(0, nx0-1, new_nx)
        y_new = tf.linspace(0, ny0-1, new_ny)
        x_new = tf.cast(x_new, tf.float32)
        y_new = tf.cast(y_new, tf.float32)
        yy_new, xx_new = tf.meshgrid(y_new, x_new, indexing='ij')
        xi = tf.reshape(xx_new, new_nx*new_ny)
        yi = tf.reshape(yy_new, new_nx*new_ny)
        # get neighboring indices
        x1 = tf.math.floor(xi)
        x1 = tf.math.minimum(tf.math.maximum(x1,0),nx0-1)
        x2 = tf.math.ceil(xi)
        x2 = tf.math.minimum(tf.math.maximum(x2,0),nx0-1)
        y1 = tf.math.floor(yi)
        y1 = tf.math.minimum(tf.math.maximum(y1,0),ny0-1)
        y2 = tf.math.ceil(yi)
        y2 = tf.math.minimum(tf.math.maximum(y2,0),ny0-1)
        # profile at neighboring index
        x1_idx = tf.cast(x1, tf.int32)
        x2_idx = tf.cast(x2, tf.int32)
        y1_idx = tf.cast(y1, tf.int32)
        y2_idx = tf.cast(y2, tf.int32)
        fcoor11 = tf.experimental.numpy.swapaxes([y1_idx,x1_idx], axis1=0, axis2=1)
        fcoor12 = tf.experimental.numpy.swapaxes([y1_idx,x2_idx], axis1=0, axis2=1)
        fcoor21 = tf.experimental.numpy.swapaxes([y2_idx,x1_idx], axis1=0, axis2=1)
        fcoor22 = tf.experimental.numpy.swapaxes([y2_idx,x2_idx], axis1=0, axis2=1)
        f11 = tf.gather_nd(img_tf, indices=fcoor11)
        f12 = tf.gather_nd(img_tf, indices=fcoor12)
        f21 = tf.gather_nd(img_tf, indices=fcoor21)
        f22 = tf.gather_nd(img_tf, indices=fcoor22)
        # interpolate
        delta = tf.constant(1.0e-9, dtype=tf.float32)  # prevent division by 0
        f1x = f11 + (f12-f11) * (xi-x1) / tf.math.maximum(x2-x1, delta)
        f2x = f21 + (f22-f21) * (xi-x1) / tf.math.maximum(x2-x1, delta)
        fyx = f1x + (f2x-f1x) * (yi-y1) / tf.math.maximum(y2-y1, delta)
        fyx = tf.reshape(fyx, (new_ny,new_nx))
    return fyx.numpy()# pytorch
def resizeTorch(img, new_nx, new_ny, device=0):
    ny0, nx0 = img.shape
    img_torch = torch.tensor(img, dtype=torch.float32).to(device)
    zero_f32 = torch.tensor(0, dtype=torch.float32).to(device)
    x_max = torch.tensor(nx0-1, dtype=torch.float32).to(device)
    y_max = torch.tensor(ny0-1, dtype=torch.float32).to(device)
    delta = torch.tensor(1.0e-9, dtype=torch.float32).to(device) # prevent division by 0
    # coordinate for interpolation
    x_new = torch.linspace(zero_f32,x_max, new_nx).to(device)
    y_new = torch.linspace(zero_f32,y_max, new_ny).to(device)
    yy_new, xx_new = torch.meshgrid(y_new, x_new, indexing='ij')
    xi = xx_new.flatten()
    yi = yy_new.flatten()
    # get neighboring indices
    x1 = torch.floor(xi)
    x1 = torch.minimum(torch.maximum(x1,zero_f32),x_max)
    x2 = torch.ceil(xi)
    x2 = torch.minimum(torch.maximum(x2,zero_f32),x_max)
    y1 = torch.floor(yi)
    y1 = torch.minimum(torch.maximum(y1,zero_f32),y_max)
    y2 = torch.ceil(yi)
    y2 = torch.minimum(torch.maximum(y2,zero_f32),y_max)
    # profile at neighboring index
    x1_idx = x1.type(torch.long) # indexing support torch.long
    x2_idx = x2.type(torch.long)
    y1_idx = y1.type(torch.long)
    y2_idx = y2.type(torch.long)
    f11 = img_torch[y1_idx,x1_idx]
    f12 = img_torch[y1_idx,x2_idx]
    f21 = img_torch[y2_idx,x1_idx]
    f22 = img_torch[y2_idx,x2_idx]
    # interpolate
    f1x = f11 + (f12-f11) * (xi-x1) / torch.maximum(x2-x1, delta)
    f2x = f21 + (f22-f21) * (xi-x1) / torch.maximum(x2-x1, delta)
    fyx = f1x + (f2x-f1x) * (yi-y1) / torch.maximum(y2-y1, delta)
    fyx = fyx.reshape((new_ny,new_nx))
    return fyx.cpu().detach().numpy()# cupy
def resizeCupy(img, new_nx, new_ny, device=0):
    ny0, nx0 = img.shape
    with cupy.cuda.Device(device):
        img_cupy = cupy.array(img)
        # coordinate for interpolation
        x_new = cupy.linspace(0, nx0-1, new_nx)
        y_new = cupy.linspace(0, ny0-1, new_ny)
        yy_new, xx_new = cupy.meshgrid(y_new, x_new, indexing='ij')
        xi = xx_new.flatten()
        yi = yy_new.flatten()
        # get neighboring indices
        x1 = cupy.floor(xi)
        x1 = cupy.minimum(cupy.maximum(x1,0),nx0-1)
        x2 = cupy.ceil(xi)
        x2 = cupy.minimum(cupy.maximum(x2,0),nx0-1)
        y1 = cupy.floor(yi)
        y1 = cupy.minimum(cupy.maximum(y1,0),ny0-1)
        y2 = cupy.ceil(yi)
        y2 = cupy.minimum(cupy.maximum(y2,0),ny0-1)
        # profile at neighboring index
        x1_idx = x1.astype(int)
        x2_idx = x2.astype(int)
        y1_idx = y1.astype(int)
        y2_idx = y2.astype(int)
        f11 = img_cupy[y1_idx,x1_idx]
        f12 = img_cupy[y1_idx,x2_idx]
        f21 = img_cupy[y2_idx,x1_idx]
        f22 = img_cupy[y2_idx,x2_idx]
        # interpolate
        delta = cupy.float32(1.0e-9)  # prevent division by 0
        f1x = f11 + (f12-f11) * (xi-x1) / cupy.maximum(x2-x1, delta)
        f2x = f21 + (f22-f21) * (xi-x1) / cupy.maximum(x2-x1, delta)
        fyx = f1x + (f2x-f1x) * (yi-y1) / cupy.maximum(y2-y1, delta)
        fyx = fyx.reshape((new_ny, new_nx))
    return cupy.asnumpy(fyx)

Figure 4: The image is resized to arbitrary sizes.

Figure 4 shows image resized with arbitrary sizes. Table 2 and 3 shows the speed comparison for resizing a grayscale image of size 512x512 to different ratios (factors). Due to the higher abstraction of Python language, there is an overhead cost for invoking and initialization of the frameworks in Table 1. In reality, the computation is performed repeatedly for different data (images). Hence, it is reasonable for us to neglect the initialization duration and initialization will not be considered for the rest of comparison in this tutorial.

Table 2: Speed comparison for resizing to different sizes including duration for initialization. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.

Table 3: Speed comparison for resizing to different sizes without including duration for initialization. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.

Image Denoising (Deconvolution)

Highlights:
- Deconvolution with (inverse) Fourier transform
- Array arithmetic

This part will introduce the total variation deconvolution [2] picture denoising approach, which calls for resolving a minimization issue. Total variation deconvolution assume a correction f(x,y) to a noisy image fo(x,y) which satisfy the below regularization:

The minimization problem can be solved using Euler-Lagrange equation:

After discretization, the optimization problem is finally converted into a deconvolution problem from fo(x,y) to f(x,y) with a kernel K:

A deconvolution problem can be solved easily in the Fourier’s domain:

The implementation for a deconvolution operation follows the procedures:

1) Create the convolution kernel K.

kernel = numpy.zeros(img.shape, dtype=img.dtype)
kernel[0,0] = 1+4*lambd
kernel[0,1] = -lambd
kernel[1,0] = -lambd
kernel[0,-1] = -lambd
kernel[-1,0] = -lambd

2) Perform Fourier transform on the image fo and the kernel K to get FT[fo] and FT[K].

img_fft = numpy.fft.fft2(img)
kernel_fft = numpy.fft.fft2(kernel)

3) Divide the FT[fo] by FT[K] iteratively depends on the number of iterations.

decon_fft = img_fft.copy()
for i in range(iteration):
    decon_fft = decon_fft/kernel_fft

4) Perform inverse Fourier transform on to obtain the final result.

decon_img = numpy.fft.ifft2(decon_fft)
decon_img = numpy.abs(decon_img)

The complete code for the implementation:

# numpy
def tvDecon(img, lambd, iteration):
    # deconvolution kernel
    kernel = numpy.zeros(img.shape, dtype=img.dtype)
    kernel[0,0] = 1+4*lambd
    kernel[0,1] = -lambd
    kernel[1,0] = -lambd
    kernel[0,-1] = -lambd
    kernel[-1,0] = -lambd
    # fourier transform
    img_fft = numpy.fft.fft2(img)
    kernel_fft = numpy.fft.fft2(kernel)
    # deconvolution
    decon_fft = img_fft.copy()
    for i in range(iteration):
        decon_fft = decon_fft/kernel_fft
    # inverse fourier transform
    decon_img = numpy.fft.ifft2(decon_fft)
    decon_img = numpy.abs(decon_img)
    return decon_img# tensorflow
def tvDeconTf(img, lambd, iteration, device='/device:GPU:0'):
    with tf.device(device):
        img_tf = tf.cast(tf.Variable(img), tf.complex64)
        # deconvolution kernel
        kernel = tf.Variable(tf.zeros(img_tf.shape, dtype=img_tf.dtype))
        kernel[0,0].assign(1+4*lambd)
        kernel[0,1].assign(-lambd)
        kernel[1,0].assign(-lambd)
        kernel[0,-1].assign(-lambd)
        kernel[-1,0].assign(-lambd)
        # fourier transform
        img_fft = tf.signal.fft2d(img_tf)
        kernel_fft = tf.signal.fft2d(kernel)
        # deconvolution
        decon_fft = img_fft
        for i in range(iteration):
            decon_fft = decon_fft/kernel_fft
        # inverse fourier transform
        decon_img = tf.signal.ifft2d(decon_fft)
        decon_img = tf.abs(decon_img)
        return decon_img.numpy()# pytorch
def tvDeconTorch(img, lambd, iteration, device=0):
    img_torch = torch.tensor(img).to(device)
    # deconvolution kernel
    kernel = torch.zeros(img_torch.shape, dtype=img_torch.dtype).to(device)
    kernel[0,0] = 1+4*lambd
    kernel[0,1] = -lambd
    kernel[1,0] = -lambd
    kernel[0,-1] = -lambd
    kernel[-1,0] = -lambd
    # fourier transform
    img_fft = torch.fft.fft2(img_torch)
    kernel_fft = torch.fft.fft2(kernel)
    # deconvolution
    decon_fft = img_fft
    for i in range(iteration):
        decon_fft = decon_fft/kernel_fft
    # inverse fourier transform
    decon_img = torch.fft.ifft2(decon_fft)
    decon_img = torch.abs(decon_img)
    return decon_img.cpu().detach().numpy()# cupy
def tvDeconCupy(img, lambd, iteration, device=0):
    with cupy.cuda.Device(device):
        img_cupy = cupy.array(img)
        # deconvolution kernel
        kernel = cupy.zeros(img_cupy.shape, dtype=img_cupy.dtype)
        kernel[0,0] = 1+4*lambd
        kernel[0,1] = -lambd
        kernel[1,0] = -lambd
        kernel[0,-1] = -lambd
        kernel[-1,0] = -lambd
        # fourier transform
        img_fft = cupy.fft.fft2(img_cupy)
        kernel_fft = cupy.fft.fft2(kernel)
        # deconvolution
        decon_fft = img_fft
        for i in range(iteration):
            decon_fft = decon_fft/kernel_fft
        # inverse fourier transform
        decon_img = cupy.fft.ifft2(decon_fft)
        decon_img = cupy.abs(decon_img)
        return cupy.asnumpy(decon_img)

Figure 5 shows denoising using codes implemented above with different parameters. Table 4 shows the speed comparison for denoising with images of different sizes and parameters of the algorithm.

Figure 5: Effect of denoising using total variation deconvolution. (Left) original 8-bit grayscale image. (Center) denoise with parameters lambda=2, iteration=3. (Right) denoise with parameters lambda=2, iteration=10.

Table 4: Seed comparison for denoising on images with different sizes and parameters. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.

Clustering (K-means)

Highlights:
- Multi-dimensional array manipulation
- Branchless computing

Clustering is the task of grouping and partitioning a set of data based on their similarity. The algorithm for clustering differs in similarity measure and partitioning methodology. The K-means [3] clustering method, one of the simplest in machine learning, is the clustering algorithm that will be studied in this part.

The K-means algorithm involves partitioning N observations into p clusters in which each observation belongs to the cluster with the nearest mean. Given a set of data (x1, x2, …, xn), the algorithm aims to find a set of clusters {S1, S2, …, Sp} with centroids (m1, m2, …, mp) in which the data are grouped into cluster with nearest distance from the respective centroid.

A standard implementation involves an iterative process of assigning each data to the cluster with the nearest centroids and recalculating the means (centroids) for data assigned to each cluster. This method requires an initialization (guess) of the centroids in order to run. The initial guess is critical and there is plenty of study for good results. The initialization is not the focus here and we will use a random initialization for the implementation. In general, the centroids at the (k+1)-th iteration are calculated based on the distance of the data from the centroids in k-th iteration.

To implement a branch-less operation, we introduce a weight term as assignment for the data to the nearest clusters. The term wij equal to 1 if i-th data belongs to j-th cluster and 0 otherwise .This enables us to perform pure arithmetic operations (at least in python). For a loop-less operation, we used the broadcasting properties of the multidimensional arrays to calculate the distance between each pair of data and centroid, as displayed in Figure 6. The overall shape of the array is (k, N, n_dim).

Figure 6: Shape of arrays in computation.

The implementation for a K-means clustering follows the procedures:

1) Initial guess (randomly generate within the data range)

n_data = data.shape[0]
n_dim = data.shape[1]
init_idx = numpy.arange(n_data)
numpy.random.seed(random_state)
numpy.random.shuffle(init_idx)
mew = data[init_idx[:n_cluster]]

2) Calculate distance and identify closest cluster

dist = numpy.linalg.norm(data_np - mew, axis=2, keepdims=True)
closest_clu = numpy.argmin(dist, axis=0, keepdims=True)

3) Assign points to closest cluster using weightage

closest_clu_idx = closest_clu*n_data+nd_arr_idx
weight = numpy.zeros(dist.shape, dtype=numpy.float32)
numpy.put(weight,closest_clu_idx.ravel(),1)

4) Recalculate means

weight_sum = numpy.sum(weight,axis=1,keepdims=True)
mew = numpy.sum(weight * data_np, axis=1, keepdims=True) / numpy.maximum(1, weight_sum)

The complete code for the implementation:

# numpy
def kmeans(data, n_cluster, iteration, random_state=None):
    n_data = data.shape[0]
    n_dim = data.shape[1]
    # initial guess
    init_idx = numpy.arange(n_data)
    numpy.random.seed(random_state)
    numpy.random.shuffle(init_idx)
    mew_init = data[init_idx[:n_cluster]]
    mew_init = numpy.expand_dims(mew_init, axis=1)
    mew = mew_init.copy()
    data_np = numpy.expand_dims(data, axis=0).astype(numpy.float32)
    label_s = numpy.zeros(n_data,dtype=int)
    nd_arr_idx = numpy.arange(n_data).reshape((1,n_data,1))
    # iteratively recalculate means based on closest distance
    for i in range(iteration):
        # calculate distance
        dist = numpy.linalg.norm(data_np - mew, axis=2, keepdims=True)
        # identify closest cluster
        closest_clu = numpy.argmin(dist, axis=0, keepdims=True)
        label_s = closest_clu.ravel()
        # assign points to closest cluster using weightage
        closest_clu_idx = closest_clu*n_data+nd_arr_idx
        weight = numpy.zeros(dist.shape, dtype=numpy.float32)
        numpy.put(weight,closest_clu_idx.ravel(),1)
        # recalculate means
        weight_sum = numpy.sum(weight,axis=1,keepdims=True)
        mew = numpy.sum(weight * data_np, axis=1, keepdims=True) / numpy.maximum(1, weight_sum)
        # handle cluster with no data points
        zero_weight_loc = numpy.asarray(weight_sum==0, dtype=numpy.float32)
        mew = mew*(1-zero_weight_loc)+zero_weight_loc*mew_init
    return label_s, mew[:,0,:]# tensorflow
def kmeansTf(data, n_cluster, iteration, random_state=None, device='/device:GPU:0'):
    n_data = data.shape[0]
    n_dim = data.shape[1]
    # initial guess
    init_idx = numpy.arange(n_data)
    numpy.random.seed(random_state)
    numpy.random.shuffle(init_idx)
    mew_init = data[init_idx[:n_cluster]]
    mew_init = numpy.expand_dims(mew_init, axis=1)
    mew = mew_init.copy()
    # conversion
    with tf.device(device):
        mew_init = tf.constant(mew_init, dtype=tf.float32)
        mew = tf.Variable(mew, dtype=tf.float32)
        label_s = tf.Variable(tf.zeros(n_data, dtype=tf.int32))
        data_tf = tf.constant(data, dtype=tf.float32)
        data_tf = tf.expand_dims(data_tf, axis=0)
        one_f32 = tf.constant(1.0, dtype=tf.float32)
        ndata_int32 = tf.constant(n_data, dtype=tf.int32)
        nd_arr_idx = tf.range(n_data, dtype=tf.int32)
        nd_arr_idx = tf.expand_dims(nd_arr_idx, axis=1)
        weight_assign = tf.ones(n_data, dtype=tf.float32)
        # iteratively recalculate means based on closest distance
        for i in range(iteration):
            # calculate distance
            dist = tf.norm(data_tf-mew, axis=2, keepdims=True)
            # identify closest cluster
            closest_clu = tf.cast(tf.math.argmin(dist, axis=0), dtype=tf.int32)
            label_s = tf.experimental.numpy.ravel(closest_clu)
            # assign points to closest cluster using weightage
            closest_clu_idx = closest_clu*ndata_int32+nd_arr_idx
            weight = tf.scatter_nd(closest_clu_idx, weight_assign, (n_cluster*n_data,))
            weight = tf.reshape(weight, (n_cluster,n_data,1))
            # recalculate means
            weight_sum = tf.math.reduce_sum(weight,axis=1,keepdims=True)
            mew = tf.math.reduce_sum(weight*data_tf,axis=1,keepdims=True)/tf.math.maximum(one_f32,weight_sum)
            # handle cluster with no data points
            zero_weight_loc = tf.cast(weight_sum==0, dtype=tf.float32)
            mew = mew*(one_f32-zero_weight_loc)+zero_weight_loc*mew_init
        return label_s.numpy(), mew.numpy()[:,0,:]# pytorch
def kmeansTorch(data, n_cluster, iteration, random_state=None, device=0):
    n_data = data.shape[0]
    n_dim = data.shape[1]
    # initial guess
    init_idx = numpy.arange(n_data)
    numpy.random.seed(random_state)
    numpy.random.shuffle(init_idx)
    mew_init = data[init_idx[:n_cluster]]
    mew_init = numpy.expand_dims(mew_init, axis=1)
    mew = mew_init.copy()
    # conversion
    mew_init = torch.tensor(mew_init, dtype=torch.float32).to(device)
    mew = torch.tensor(mew, dtype=torch.float32).to(device)
    label_s = torch.zeros(n_data, dtype=torch.int32).to(device)
    data_torch = torch.tensor(data, dtype=torch.float32).to(device)
    data_torch = torch.unsqueeze(data_torch, dim=0)
    zero_f32 = torch.tensor(0, dtype=torch.float32).to(device)
    one_f32 = torch.tensor(1, dtype=torch.float32).to(device)
    ndata_int32 = torch.tensor(n_data, dtype=torch.int32)
    nd_arr_idx = torch.reshape(torch.arange(n_data), (1,n_data,1)).to(device)
    weight = torch.zeros((n_cluster,n_data,1), dtype=torch.float32).to(device)
    weight_assign = torch.ones((1,n_data,1), dtype=torch.float32).to(device)
    # iteratively recalculate means based on closest distance
    for i in range(iteration):
        # calculate distance
        dist = torch.linalg.norm(data_torch-mew, dim=2, keepdim=True)
        # identify closest cluster
        closest_clu = torch.argmin(dist, dim=0, keepdim=True)
        label_s = torch.flatten(closest_clu)
        # assign points to closest cluster using weightage
        closest_clu_idx = closest_clu*ndata_int32+nd_arr_idx
        weight = weight*zero_f32
        weight = weight.put_(torch.flatten(closest_clu_idx),weight_assign)
        # recalculate means
        weight_sum = torch.sum(weight,dim=1,keepdim=True)
        mew = torch.sum(weight*data_torch,dim=1,keepdim=True)/torch.maximum(one_f32,weight_sum)
        # handle cluster with no data points
        zero_weight_loc = (weight_sum==0).type(torch.float32)
        mew = mew*(one_f32-zero_weight_loc)+zero_weight_loc*mew_init
    return label_s.cpu().detach().numpy(), mew.cpu().detach().numpy()[:,0,:]# cupy
def kmeansCupy(data, n_cluster, iteration, random_state=None, device=0):
    n_data = data.shape[0]
    n_dim = data.shape[1]
    # initial guess
    init_idx = numpy.arange(n_data)
    numpy.random.seed(random_state)
    numpy.random.shuffle(init_idx)
    mew_init = data[init_idx[:n_cluster]]
    mew_init = numpy.expand_dims(mew_init, axis=1)
    mew = mew_init.copy()
    # conversion
    with cupy.cuda.Device(device):
        mew_init = cupy.array(mew_init, dtype=cupy.float32)
        mew = cupy.array(mew, dtype=cupy.float32)
        label_s = cupy.zeros(n_data,dtype=int)
        data_cupy = cupy.asarray(data, dtype=cupy.float32)
        data_cupy = cupy.expand_dims(data_cupy, axis=0)
        zero_f32 = cupy.float32(0.0)
        one_f32 = cupy.float32(1.0)
        ndata_int32 = cupy.int32(n_data)
        nd_arr_idx = cupy.arange(n_data).reshape((1,n_data,1))
        weight = cupy.zeros((n_cluster,n_data,1), dtype=cupy.float32)
        # iteratively recalculate means based on closest distance
        for i in range(iteration):
            # calculate distance
            dist = cupy.linalg.norm(data_cupy-mew, axis=2, keepdims=True)
            # identify closest cluster
            closest_clu = cupy.argmin(dist, axis=0, keepdims=True)
            label_s = closest_clu.ravel()
            # assign points to closest cluster using weightage
            closest_clu_idx = closest_clu*ndata_int32+nd_arr_idx
            weight = weight*zero_f32
            cupy.put(weight,closest_clu_idx.ravel(),1)
            # recalculate means
            weight_sum = cupy.sum(weight,axis=1,keepdims=True)
            mew = cupy.sum(weight*data_cupy,axis=1,keepdims=True)/cupy.maximum(one_f32,weight_sum)
            # handle cluster with no data points
            zero_weight_loc = cupy.asarray(weight_sum==0, dtype=cupy.float32)
            mew = mew*(one_f32-zero_weight_loc)+zero_weight_loc*mew_init
        return cupy.asnumpy(label_s), cupy.asnumpy(mew)[:,0,:]

Figure 7 shows clustered results of an unlabeled 2D data. The result is far from perfect due to the limitation of K-means itself. Another unsupervised algorithm called gaussian mixture modelling (GMM) [4] is better suited to this dataset, as shown in the rightmost of the Figure 7. Table 5 show the comparison for implementation of K-means with different libraries. The K-means from Scikit-Learn library also added for comparison. A pure CPU implementation with NumPy costs about 20% more processing time than Scikit-Learn for large datasets.

Figure 7: Clustering results, each cluster is marked with different color. Their respective centroid is marked ‘X’. (Leftmost) unlabeled data. (Center left) K-mean with 2 iterations. (Center right) K-mean with 20 iterations. (Rightmost) GMM with 20 iterations.

Table 5: Speed comparison for K-means with different dataset sizes and iterations. Test system: Intel i5–1135G7 + Nvidia MX330, Windows 10.

There is an interesting results showing that CuPy is much slower than TensorFlow and PyTorch in this test. The culprit lies in the operation along the axis. Current version (11.1.0) of CuPy is not optimized for these operations.

Conclusion

The examples showed all of the libraries discussed can accomplish our task very well. Table 5 shows a comparison between each library to help us decide which is the right one for our projects.

Table 6: Comparison between TensorFlow, PyTorch and CuPy.

In a nutshell, Python programming with GPU optimization can be accomplished effortlessly and without any difficulties. The code is quite similar to NumPy’s. Because of this, it is simpler for us to test our method in NumPy before moving on to TensorFlow, PyTorch, or CuPy for acceleration. We can greatly increase speed without sacrificing flexibility. Avoiding many loops and intricate branching is a good rule of thumb when writing Python code. This is valid for both CPU and GPU computing.