GPU Acceleration with CUDA

Venkat Ramanan
Nov 15, 2020 · 8 min read

We have explored the purpose of GPU for accelerating an algorithm in the last blog, In this blog we are going to accelerate the Mandelbrot set function from the last blog using GPU. If you haven’t read the last blog here is the link for that. In this blog we are going to explore the implementation of Mandelbrot set with GPU acceleration.

Requirements :

Hardware requirements:

  • Any GPU higher than 1050 Ti(GPU released after 1050Ti should work, code in this blog is tested with 1050 and 1650)

Software requirements:

  • Any Linux OS(Tested in Ubuntu 20.04 LTS)
  • python 3
  • Pycuda
  • matplotlib
  • numpy

you can install the above packages with python pip. Pycuda requires cuda drivers to be installed, you can refer the instructions in this page for cuda driver installation. Before getting into gpu programming we’ll first detect the type of the cpu and gpu we are using. Here is the colab link for this blog.

!lscpuArchitecture:                    x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 24
Model name: AMD Ryzen 5 3550H with Radeon Vega Mobile Gfx
Stepping: 1
Frequency boost: enabled
CPU MHz: 1361.948
CPU max MHz: 2100.0000
CPU min MHz: 1400.0000
BogoMIPS: 4192.45
Virtualization: AMD-V
L1d cache: 128 KiB
L1i cache: 256 KiB
L2 cache: 2 MiB
L3 cache: 4 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
ia prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditiona
l, STIBP disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr
r pge mca cmov pat pse36 clflush mmx fxsr sse s
se2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtsc
p lm constant_tsc rep_good nopl nonstop_tsc cpu
id extd_apicid aperfmperf pni pclmulqdq monitor
ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes
xsave avx f16c rdrand lahf_lm cmp_legacy svm ex
tapic cr8_legacy abm sse4a misalignsse 3dnowpre
fetch osvw skinit wdt tce topoext perfctr_core
perfctr_nb bpext perfctr_llc mwaitx cpb hw_psta
te sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2
smep bmi2 rdseed adx smap clflushopt sha_ni xs
aveopt xsavec xgetbv1 xsaves clzero irperf xsav
eerptr arat npt lbrv svm_lock nrip_save tsc_sca
le vmcb_clean flushbyasid decodeassists pausefi
lter pfthreshold avic v_vmsave_vmload vgif over
flow_recov succor smca

The above command shows the type of cpu we are using. I have an AMD Ryzen5 3550H processor which is a 64bit computer. We can see the maximum and minimum clock speed of the cpu and cpu count or number of cores which is 8. The above command shows the general specifications of a cpu now we can see the gpu attached with our cpu.

!lspci | grep -e "NVIDIA"01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10fa (rev a1)

In the above command we have listed the PCI bus in the computer and used grep command to search the nvidia gpu. Why we have listed the pci bus? , because the gpu in our computer is connected with our computer through PCI bus, you can learn more about PCI bus in this link. Now we can see the gpu model which is GTX 1650 which is connected with the computer. Now we can install pycuda.

!pip3 install pycudaRequirement already satisfied: pycuda in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (2020.1)
Requirement already satisfied: pytools>=2011.2 in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (from pycuda) (2020.4.3)
Requirement already satisfied: decorator>=3.2.0 in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (from pycuda) (4.4.2)
Requirement already satisfied: appdirs>=1.4.0 in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (from pycuda) (1.4.4)
Requirement already satisfied: mako in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (from pycuda) (1.1.3)
Requirement already satisfied: six>=1.8.0 in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (from pytools>=2011.2->pycuda) (1.15.0)
Requirement already satisfied: numpy>=1.6.0 in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (from pytools>=2011.2->pycuda) (1.19.3)
Requirement already satisfied: MarkupSafe>=0.9.2 in /home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/lib/python3.7/site-packages (from mako->pycuda) (1.1.1)
WARNING: You are using pip version 20.1.1; however, version 20.2.4 is available.
You should consider upgrading via the '/home/cvr/playground/cuda-gpu/Hands-On-GPU-Programming-with-Python-and-CUDA/env/bin/python3.7 -m pip install --upgrade pip' command.

After installing pycuda we can test the installation.

import pycuda.driver as drv # 1
drv.init() # 2
print("No of GPUs attached :",drv.Device.count()) #3
gpu = drv.Device(0) # 4
print("GPU name :",gpu.name()) # 5
print("GPU memory : ",gpu.total_memory()/1024**2) # 6
gpu_attributes = gpu.get_attributes().items() # 6
for k,v in gpu_attributes:
print(k,v)
No of GPUs attached : 1
GPU name : GeForce GTX 1650
GPU memory : 3911.875
ASYNC_ENGINE_COUNT 3
CAN_MAP_HOST_MEMORY 1
CLOCK_RATE 1560000
COMPUTE_CAPABILITY_MAJOR 7
COMPUTE_CAPABILITY_MINOR 5
COMPUTE_MODE DEFAULT
CONCURRENT_KERNELS 1
ECC_ENABLED 0
GLOBAL_L1_CACHE_SUPPORTED 1
GLOBAL_MEMORY_BUS_WIDTH 128
GPU_OVERLAP 1
INTEGRATED 0
KERNEL_EXEC_TIMEOUT 1
L2_CACHE_SIZE 1048576
LOCAL_L1_CACHE_SUPPORTED 1
MANAGED_MEMORY 1
MAXIMUM_SURFACE1D_LAYERED_LAYERS 2048
MAXIMUM_SURFACE1D_LAYERED_WIDTH 32768
MAXIMUM_SURFACE1D_WIDTH 32768
MAXIMUM_SURFACE2D_HEIGHT 65536
MAXIMUM_SURFACE2D_LAYERED_HEIGHT 32768
MAXIMUM_SURFACE2D_LAYERED_LAYERS 2048
MAXIMUM_SURFACE2D_LAYERED_WIDTH 32768
MAXIMUM_SURFACE2D_WIDTH 131072
MAXIMUM_SURFACE3D_DEPTH 16384
MAXIMUM_SURFACE3D_HEIGHT 16384
MAXIMUM_SURFACE3D_WIDTH 16384
MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS 2046
MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH 32768
MAXIMUM_SURFACECUBEMAP_WIDTH 32768
MAXIMUM_TEXTURE1D_LAYERED_LAYERS 2048
MAXIMUM_TEXTURE1D_LAYERED_WIDTH 32768
MAXIMUM_TEXTURE1D_LINEAR_WIDTH 268435456
MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH 32768
MAXIMUM_TEXTURE1D_WIDTH 131072
MAXIMUM_TEXTURE2D_ARRAY_HEIGHT 32768
MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES 2048
MAXIMUM_TEXTURE2D_ARRAY_WIDTH 32768
MAXIMUM_TEXTURE2D_GATHER_HEIGHT 32768
MAXIMUM_TEXTURE2D_GATHER_WIDTH 32768
MAXIMUM_TEXTURE2D_HEIGHT 65536
MAXIMUM_TEXTURE2D_LINEAR_HEIGHT 65000
MAXIMUM_TEXTURE2D_LINEAR_PITCH 2097120
MAXIMUM_TEXTURE2D_LINEAR_WIDTH 131072
MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT 32768
MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH 32768
MAXIMUM_TEXTURE2D_WIDTH 131072
MAXIMUM_TEXTURE3D_DEPTH 16384
MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE 32768
MAXIMUM_TEXTURE3D_HEIGHT 16384
MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE 8192
MAXIMUM_TEXTURE3D_WIDTH 16384
MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE 8192
MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS 2046
MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH 32768
MAXIMUM_TEXTURECUBEMAP_WIDTH 32768
MAX_BLOCK_DIM_X 1024
MAX_BLOCK_DIM_Y 1024
MAX_BLOCK_DIM_Z 64
MAX_GRID_DIM_X 2147483647
MAX_GRID_DIM_Y 65535
MAX_GRID_DIM_Z 65535
MAX_PITCH 2147483647
MAX_REGISTERS_PER_BLOCK 65536
MAX_REGISTERS_PER_MULTIPROCESSOR 65536
MAX_SHARED_MEMORY_PER_BLOCK 49152
MAX_SHARED_MEMORY_PER_MULTIPROCESSOR 65536
MAX_THREADS_PER_BLOCK 1024
MAX_THREADS_PER_MULTIPROCESSOR 1024
MEMORY_CLOCK_RATE 4001000
MULTIPROCESSOR_COUNT 16
MULTI_GPU_BOARD 0
MULTI_GPU_BOARD_GROUP_ID 0
PCI_BUS_ID 1
PCI_DEVICE_ID 0
PCI_DOMAIN_ID 0
STREAM_PRIORITIES_SUPPORTED 1
SURFACE_ALIGNMENT 512
TCC_DRIVER 0
TEXTURE_ALIGNMENT 512
TEXTURE_PITCH_ALIGNMENT 32
TOTAL_CONSTANT_MEMORY 65536
UNIFIED_ADDRESSING 1
WARP_SIZE 32

There is a lot of information lying in the above command, I’ll try to decode the important ones. First we’ll import the pycuda and initialise the package.Then we’ll print number of gpu’s attached with the computer which is one in this case, then we have printed the name of the gpu and gpu memory in MB. Here we refer primary gpu memory or gpu RAM as memory. Then we have printed the attributes of gpu which includes the details such as gpu clock speed , multiprocessor count , compute capability , memory clock rate etc. Another important attribute is no of cores in the gpu. A GPU divides its individual cores up into larger units known as Streaming Multiprocessors (SMs). In our case GTX 1650 gpu has 16 SM and 1024 cores. We calculate the number of cores with something called as compute capablilty of the gpu, you can explore more on compute capability here . We’ll revisit Mandelbrot set now.

from time import time # 1
from matplotlib import pyplot as plt # 2
import numpy as np # 3
import pycuda.autoinit
from pycuda import gpuarray # 4
from pycuda.elementwise import ElementwiseKernel # 5
  • we have imported the time module in python to calculate the run time of the algorithm.
  • In second line we have imported the matplotlib for ploting the graph.
  • We have imported the numpy in third line.
  • We have imported the gpuarray in line four which is similar to numpy array object with gpu acceleration capability.
  • In fifth line we have imported Elementwisekernel of pycuda. In cuda we refer a parallel function as kernel and the elementwisekernel is used for inplace array operations or individual value manipulation in an array. There are many kerrnel function in cuda which we’ll explore in upcoming blogs.
mandel_ker = ElementwiseKernel(
"pycuda::complex<float> *lattice, float *mandelbrot_graph, int max_iters, float upper_bound",
"""
mandelbrot_graph[i] = 1;

pycuda::complex<float> c = lattice[i];
pycuda::complex<float> z(0,0);

for (int j = 0; j < max_iters; j++)
{

z = z*z + c;

if(abs(z) > upper_bound)
{
mandelbrot_graph[i] = 0;
break;
}

}

""",
"mandel_ker")

We have initialised the elementwisekernel above.

  • We have defined the input and output of the kernel which is lattice and mandelbrot_graph.
  • Then we have max_iters for number of iteration in mandelbrot set and upper_bound parameter.
  • The ElementwiseKernel object has two parametes one is cuda function and another one is name of the kernel which is mandel_ker.
  • We can see that the pycuda function declaration is done in first line with single quotation and definition in done with triple quotation. In cuda we can write multi-lined statements with triple quotation.
def gpu_mandelbrot(width, height, real_low, real_high, imag_low, imag_high, max_iters, upper_bound):

# we set up our complex lattice as such
real_vals = np.matrix(np.linspace(real_low, real_high, width), dtype=np.complex64) # 1
imag_vals = np.matrix(np.linspace( imag_high, imag_low, height), dtype=np.complex64) * 1j
mandelbrot_lattice = np.array(real_vals + imag_vals.transpose(), dtype=np.complex64) # 2

# copy complex lattice to the GPU
mandelbrot_lattice_gpu = gpuarray.to_gpu(mandelbrot_lattice) # 3

# allocate an empty array on the GPU
mandelbrot_graph_gpu = gpuarray.empty(shape=mandelbrot_lattice.shape, dtype=np.float32) # 4

mandel_ker( mandelbrot_lattice_gpu, mandelbrot_graph_gpu, np.int32(max_iters), np.float32(upper_bound)) # 5

mandelbrot_graph = mandelbrot_graph_gpu.get() # 6

return mandelbrot_graph

We have defined the method which would calculate mandelbrot set by calling cuda kernel

  • We have created an lattice/matrix of complex numbers in (1)
  • Then we have converted the lattice to a numpy array in (2)
  • In (3) we have copied the data from host(computer) memory to device(gpu) memory.
  • In (4) we have created an empty gpu array to store the returned array from ElementwiseKernel.
  • In (5) we have called mandel_ker with mandelbrot_lattice_gpu , mandelbrot_graph_gpu , max_iters and upper_bound. we can see max_iters and upper_bound is type cased before passing to ElementwiseKernel , This is because we write gpu kernels in c programming language which is a statically-typed language so we should initialise the variables with type. So we have strictly followed C lang principles and type cased the values before passing to function. Inside the kernel we have lattice and mandelbrot_graph as pointers , max_iters and upper_bound as normal variables , this is because we can only pass array as pointers to kernel and we can pass normal variable as just values or pointers. Inside the kernel function we have declare the lattice as complex datatype which is equivalent to numpy.complex64 type and we have initialised the output as one. We can see the “i” in initialisation of mandelbrot_graph which is not defined (mandelbrot_graph[i] = 1), this is because as we said earlier iteration of “i” variable is automatically taken care by cuda( cuda will parallelised the variable “i” automatically) so that we can initialise “i”. Then we have declared an variable z of same type complex and then we have computed the mandelbrot set and stored the value in mandelbrot_graph( output variable). We can see that we have assigned ‘c’ as ‘lattice[i]’ , as we have seen before that the variable “i” will be automatically parallelised so we have only used one for loop compared to the mandelbrot set which we saw in previous blog. In this example we have passed the input ‘lattice’ as array so we have used only one for loop.
  • After computing the mandelbrot set we moved back the data from gpu memory to cpu memory with .get() method in (6).
t1 = time()
mandel = gpu_mandelbrot(512,512,-2,2,-2,2,512, 2)
t2 = time()

mandel_time = t2 - t1

t1 = time()
fig = plt.figure(1)
plt.imshow(mandel, extent=(-2, 2, -2, 2))
#plt.savefig('mandelbrot.png', dpi=fig.dpi)
t2 = time()

dump_time = t2 - t1

print ('It took {} seconds to calculate the Mandelbrot graph.'.format(mandel_time))
print ('It took {} seconds to dump the image.'.format(dump_time))
It took 0.2567741870880127 seconds to calculate the Mandelbrot graph.
It took 0.023880720138549805 seconds to dump the image.
png

In this script we have calculated the time taken for Mandelbrot set calculation and graph generation. It took 0.2 second for Mandelbrot set calculations and 0.03 second for graph generation , which is very fast compared to cpu computation which is 14 second( which we calculated in the previous blog).

Infye

Technical blogs in Artificial intelligence.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store