Python+GPU = Power, 2 Days to 20 seconds

Abhishek Mungoli
May 10 · 9 min read
Nvidia GPU ~ Image source

We have all heard in some form or another the hype around GPU in recent times. Before going any further, I want to clarify, I am no GPU expert. I have just started my journey in the world of GPU. But, it’s so mighty and powerful these days and can accomplish hell lot of tasks. I was assigned a task at my work which was taking hours to run and still showed no progress. GPU came to my rescue and solved my problem within seconds. I was able to perform a task with an estimated time of 2 days in just 20 seconds.

I will state the task in details in the upcoming sections. I will also discuss how and when to use GPU for any such similar tasks. So, stay tuned and I promise it’s going to be worth the read. The basic flow will be understanding the task details, getting started with GPU and using GPU to solve the problem in hand. I will be using Python’s Numba library along with Nvidia Volta V100 16GB GPU.

I will also try my level best to keep things simple and intuitive but please feel free to jump to the references section and research about the field more. The main idea of this blog post is to help the audience understand the power of GPU and develop an intuition on how to efficiently utilize it in their daily work tasks.

1. Task Details

In the retail domain, finding similar or closest entities is a common task. I was provided with a list of items and each item was represented with k latent attributes. Now my task was to find top-3 similar items for every item in the list. Similarity metric chosen for the task was cosine similarity. This is what my data looked like

Table 1: Data Item-list with 64-latent features

Task complexity

I was provided a list of about 10⁵ items. Finding top-3 similar items for an item on the basis of cosine similarity will require doing a cosine similarity comparison with every other item in the list. It will require n * k operations, where n is the number of items and k attributes per item. We need to do the dot product of that item with every item in the list.

Code to find top-3 similar items given an item

Now, finding top-3 for all items in th list, the complexity gets multiplied by another n. The final complexity becomes O(n*n*k) = O(n²k).

Code to find top-3 similar items for each item in the list

Sample Run and Time Estimation

I ran the code to find top-3 similar items for a subset of n = 10³ items with k = 64. It took around 17 seconds to complete in Python with an average of 3.7 *10⁶ operations per second. The code was pretty optimized with Numpy operations and arrays. Just to note, all these operations are getting performed sequentially in the CPU.

Running for n = 10³ items
Output: time-taken for n = 10³ items

Next, I increased the subset of items to n =10⁴ items. Since the complexity is O(n²k), the time increase by a factor of 100 (as n increased by a factor of 10). It took around 1700 seconds = 28.33 minutes for the code to run.

Output: time-taken for n = 10⁴ items

Next comes the important part, estimating the time for the actual list of about 10⁵ items. If we do the math, the time complexity will again increase by a factor of 100, since the algorithm’s time complexity is O(n²k).

Estimated Time = 1700 * 100 seconds = 2834 minutes = 47.2 hours ~ 2 days.

Oh Gosh!! That long!!

As you would have correctly guessed, I was able to do it very fast using the power of GPU. Actually, one will be shocked to know the gain in time after the use of GPU. I will keep the numbers for later and first get you introduced to the world of GPU.

2. CPU versus GPU

A central processing unit (CPU) is essentially the brain of any computing device, carrying out the instructions of a program by performing control, logical, and input/output (I/O) operations.

Today’s CPUs, however, have just a handful of cores, and its basic design and purpose — to process complex computations — has not really changed. Essentially, they’re mostly applicable for problems that require parsing through or interpreting complex logic in the code.

A graphical processing unit (GPU), on the other hand, has smaller-sized but many more logical cores (arithmetic logic units or ALUs, control units and memory cache) whose basic design is to process a set of simpler and more identical computations in parallel.

Figure: CPU versus GPU ~ Image source

3. When to utilize GPU Computing

CPU is better suited to handle complex linear tasks. Despite the fact the CPU cores are stronger, the GPUs can handle AI, ML & DL tasks much efficiently & quickly. The way GPU handles the workload is by processing similar parallel operations.

Understanding operations from GPUs perspective — Take an operation of searching a word in a document. This can be done serially i.e. iterating through every word or through parallel scanning or rows and scanning for the particular work.

Understanding an operation from a CPUs Perspective — The specific operations like calculating a Fibonacci can’t be made parallel. Since we can only find it only after 2 preceding number which has been calculated. This is best suited for a CPU operation.

Similarly, operations like matrix addition, multiplication can easily be computed using GPU as most of these operations in matrix cells are independent and similar in nature which can be parallelized.

4. Brief about CUDA

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). Find out more.


Numba is an open-source JIT compiler that translates a subset of Python and NumPy into fast machine code using LLVM, via the llvmlite Python package. It offers a range of options for parallelizing Python code for CPUs and GPUs, often with only minor code changes. Find out more.

I have used the Numba library in my work along with Nvidia Volta V100 16GB GPU.

Architecture Details ~ Threads, Blocks and Grids

CUDA organizes a parallel computation using the abstractions of threads, blocks, and grids.

Thread: A CUDA thread is a scheduled chain of instructions running/flowing on a CUDA core (actually just a pipeline). There can be as many as 32 CUDA threads in-flight running on same CUDA core (to fill all stages of this pipeline). This is an execution of a kernel with a given index. Each thread uses its index to access elements in array such that the collection of all threads cooperatively processes the entire data set.

Block: This is a group of threads. There’s not much you can say about the execution of threads within a block — they could execute concurrently or serially and in no particular order. You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.

Grid: This is a group of blocks. There’s no synchronization at all between the blocks.

CUDA Threads, Blocks, and Grids ~ Image Source

But where do threads, blocks, and grids actually get executed? With respect to Nvidia’s G80 GPU chip, it appears the computation is distributed as follows:

Grid → GPU: An entire grid is handled by a single GPU chip.

Block → MP: The GPU chip is organized as a collection of multiprocessors (MPs), with each multiprocessor responsible for handling one or more blocks in a grid. A block is never divided across multiple MPs.

Thread → SP: Each MP is further divided into a number of stream processors (SPs), with each SP handling one or more threads in a block.

I have copied some stuff from this beautifully written blog post. Please find out more.

5. A Simple Arrays Addition Program using GPU in Python

As stated in the beginning, the main idea of this post is to help the audience understand the power of GPU and develop an intuition of how to utilize it in their daily work tasks. Getting started to code in GPU may require a bit of research. With that, let’s move to our first Arrays Addition program.

Given two arrays ‘a’ and ‘b’ of size ‘n’, we wish to generate an array ‘c’ where each element of c is the sum of the same-indexed element from ‘a’ and ‘b’. The difference here will be, instead of sequential computation, we will perform a parallel computation using GPU.

We will fire n threads/kernels. The index at which particular thread is operating can be derived from below formula :

index = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

In case of a 2-D matrix, the index has a row and column component, which can be derived as

row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

We will also have to define threads per block, let’s say tpb and blocks per grid, let’s say bpg. We will use standard numbers for them.

Another important concept to note is, whenever some computation has to be performed in GPU, data needs to be brought to GPU’s Global memory and results after computation can be transferred back to the host. This is made possible through cuda.to_device() and copy_to_host() functions provided by Numba library in python.

Below is the code for both CPU and GPU implementation of the problem, scan both the codes to see the difference.

CPU sequential implementation.
GPU parallel implementation

6. Finding the top-3 similar items for each item in the list using GPU

With a good amount of theory and practice, we are back to our original problem of finding the top-3 similar items for each item in the list using GPU computing.

The main idea here is, we have n items and we will fire n threads. Each thread will run independently and parallelly to compute top-3 items for each item in the list. One thread will be responsible for one item.

GPU implementation ~ Code to find top-3 similar items given an item

Time Taken

The total time taken by GPU to find top-3 similar items for each item in the list is 481 ms (0.5 seconds). Extra 20 seconds was needed for the device-to-host and host-to-device copying of data.

7. Conclusion

The CPU estimated time of 2 days was brought down to 20.5 seconds with the use of GPU. This was possible only because of the nature of the task. Finding top-3 similar items to Item ‘A’ is independent of finding top-3 similar items to Item ‘B’. We exploited this fact and used the parallelism provided by GPU to speed up the process. We also understood the type of tasks in which we can utilize the mighty GPU.

It takes a lot of efforts writing a good post with clarity and easy understandability for the audience. I will keep trying to do justice with my work. Follow me up at Medium and check out my previous posts. I welcome feedback and constructive criticism. The complete code of the assignment can be obtained from here.

8. Acknowledgment

This whole effort was executed through MLP (Machine Learning Platform) from Walmart Labs in terms of the infrastructure, GPU used. Thanks to Ayush Kumar for helping in optimize the workflow.

9. References



Using technology, data and design to change the way the world shops. Learn more about us -

Abhishek Mungoli

Written by

Data Scientist @Walmart. Masters in CSE from IIIT-Hyderabad, Bachelors from NERIST. Insta: simplyspartanx


Using technology, data and design to change the way the world shops. Learn more about us -

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade