How GPU Computing literally saved me at work?

Python+GPU = Power, 2 Days to 20 seconds

Abhishek Mungoli
May 10, 2019 · 9 min read
Image for post
Image for post
Nvidia GPU ~ Image source

We have all heard in some form or another the hype around GPU in recent times. Before going any further, I want to clarify, I am no GPU expert. I have just started my journey in the world of GPU. But, it’s so mighty and powerful these days and can accomplish hell lot of tasks. I was assigned a task at my work which was taking hours to run and still showed no progress. GPU came to my rescue and solved my problem within seconds. I was able to perform a task with an estimated time of 2 days in just 20 seconds.

I will state the task in details in the upcoming sections. I will also discuss how and when to use GPU for any such similar tasks. So, stay tuned and I promise it’s going to be worth the read. The basic flow will be understanding the task details, getting started with GPU and using GPU to solve the problem in hand. I will be using Python’s Numba library along with Nvidia Volta V100 16GB GPU.

I will also try my level best to keep things simple and intuitive but please feel free to jump to the references section and research about the field more. The main idea of this blog post is to help the audience understand the power of GPU and develop an intuition on how to efficiently utilize it in their daily work tasks.

1. Task Details

Image for post
Image for post
Table 1: Data Item-list with 64-latent features

Task complexity

Code to find top-3 similar items given an item

Now, finding top-3 for all items in th list, the complexity gets multiplied by another n. The final complexity becomes O(n*n*k) = O(n²k).

Code to find top-3 similar items for each item in the list

Sample Run and Time Estimation

Running for n = 10³ items
Image for post
Image for post
Output: time-taken for n = 10³ items

Next, I increased the subset of items to n =10⁴ items. Since the complexity is O(n²k), the time increase by a factor of 100 (as n increased by a factor of 10). It took around 1700 seconds = 28.33 minutes for the code to run.

Image for post
Image for post
Output: time-taken for n = 10⁴ items

Next comes the important part, estimating the time for the actual list of about 10⁵ items. If we do the math, the time complexity will again increase by a factor of 100, since the algorithm’s time complexity is O(n²k).

Estimated Time = 1700 * 100 seconds = 2834 minutes = 47.2 hours ~ 2 days.

Oh Gosh!! That long!!

As you would have correctly guessed, I was able to do it very fast using the power of GPU. Actually, one will be shocked to know the gain in time after the use of GPU. I will keep the numbers for later and first get you introduced to the world of GPU.

2. CPU versus GPU

Today’s CPUs, however, have just a handful of cores, and its basic design and purpose — to process complex computations — has not really changed. Essentially, they’re mostly applicable for problems that require parsing through or interpreting complex logic in the code.

A graphical processing unit (GPU), on the other hand, has smaller-sized but many more logical cores (arithmetic logic units or ALUs, control units and memory cache) whose basic design is to process a set of simpler and more identical computations in parallel.

Image for post
Image for post
Figure: CPU versus GPU ~ Image source

3. When to utilize GPU Computing

Understanding operations from GPUs perspective — Take an operation of searching a word in a document. This can be done serially i.e. iterating through every word or through parallel scanning or rows and scanning for the particular work.

Understanding an operation from a CPUs Perspective — The specific operations like calculating a Fibonacci can’t be made parallel. Since we can only find it only after 2 preceding number which has been calculated. This is best suited for a CPU operation.

Similarly, operations like matrix addition, multiplication can easily be computed using GPU as most of these operations in matrix cells are independent and similar in nature which can be parallelized.

4. Brief about CUDA


I have used the Numba library in my work along with Nvidia Volta V100 16GB GPU.

Architecture Details ~ Threads, Blocks and Grids

Thread: A CUDA thread is a scheduled chain of instructions running/flowing on a CUDA core (actually just a pipeline). There can be as many as 32 CUDA threads in-flight running on same CUDA core (to fill all stages of this pipeline). This is an execution of a kernel with a given index. Each thread uses its index to access elements in array such that the collection of all threads cooperatively processes the entire data set.

Block: This is a group of threads. There’s not much you can say about the execution of threads within a block — they could execute concurrently or serially and in no particular order. You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.

Grid: This is a group of blocks. There’s no synchronization at all between the blocks.

Image for post
Image for post
CUDA Threads, Blocks, and Grids ~ Image Source

But where do threads, blocks, and grids actually get executed? With respect to Nvidia’s G80 GPU chip, it appears the computation is distributed as follows:

Grid → GPU: An entire grid is handled by a single GPU chip.

Block → MP: The GPU chip is organized as a collection of multiprocessors (MPs), with each multiprocessor responsible for handling one or more blocks in a grid. A block is never divided across multiple MPs.

Thread → SP: Each MP is further divided into a number of stream processors (SPs), with each SP handling one or more threads in a block.

I have copied some stuff from this beautifully written blog post. Please find out more.

5. A Simple Arrays Addition Program using GPU in Python

Given two arrays ‘a’ and ‘b’ of size ‘n’, we wish to generate an array ‘c’ where each element of c is the sum of the same-indexed element from ‘a’ and ‘b’. The difference here will be, instead of sequential computation, we will perform a parallel computation using GPU.

We will fire n threads/kernels. The index at which particular thread is operating can be derived from below formula :

index = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

In case of a 2-D matrix, the index has a row and column component, which can be derived as

row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

We will also have to define threads per block, let’s say tpb and blocks per grid, let’s say bpg. We will use standard numbers for them.

Another important concept to note is, whenever some computation has to be performed in GPU, data needs to be brought to GPU’s Global memory and results after computation can be transferred back to the host. This is made possible through cuda.to_device() and copy_to_host() functions provided by Numba library in python.

Below is the code for both CPU and GPU implementation of the problem, scan both the codes to see the difference.

CPU sequential implementation.
GPU parallel implementation

6. Finding the top-3 similar items for each item in the list using GPU

The main idea here is, we have n items and we will fire n threads. Each thread will run independently and parallelly to compute top-3 items for each item in the list. One thread will be responsible for one item.

GPU implementation ~ Code to find top-3 similar items given an item

Time Taken

7. Conclusion

It takes a lot of efforts writing a good post with clarity and easy understandability for the audience. I will keep trying to do justice with my work. Follow me up at Medium and check out my previous posts. I welcome feedback and constructive criticism. The complete code of the assignment can be obtained from here.

8. Acknowledgment

My Youtube channel for more content:

9. References



Using technology, data and design to change the way the…

Abhishek Mungoli

Written by

Data Scientist at WalmartLabs | Masters CSE IIIT-Hyderabad | NERIST | Insta: simplyspartanx | Youtube:


Using technology, data and design to change the way the world shops. Learn more about us -

Abhishek Mungoli

Written by

Data Scientist at WalmartLabs | Masters CSE IIIT-Hyderabad | NERIST | Insta: simplyspartanx | Youtube:


Using technology, data and design to change the way the world shops. Learn more about us -

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store