Running CUDA C/C++ in Jupyter or how to run nvcc in Google CoLab

Not that long ago Google made its research tool publicly available. Besides that it is a fully functional Jupyter Notebook with pre-installed Tensorflow and some other ML/DL tools, you can take advantage of Nvidia GPU. Yes a 11.5 GB Nvidia K80 GPU for free.

Not that long ago Nvidia announced its Deep Learning Institute where you can acquire basics of CUDA programming in both Python and C/C++. After the course you will get some good recommendations where to go next. Although, in era of ultrabooks not many of us have a dedicated graphic card on board and Nvidia even rarer. There are a few options, when you have more experience you can write your code offline and then upload to your Google/AWS/FloydHub GPU instance. However, when you just started it might be useful to compile your code and see results in an on-live solution.

I guess, I’m done with introduction. Let’s get our hands dirty and set up our learning environment. First, create a new Notebook.

Image for post
Image for post

Please select Python 3 Notebook in the pop-up window.

If you have worked with Jupyter before, the interface will look familiar. A bit more stylish though. If you haven’t, don’t worry. It’s a pretty simple and very powerful tool, that’s way it is so popular.

Next, we need to switch our runtime from CPU to GPU. We just 2 clicks away.

Image for post
Image for post

Change runtime type in Notebook settings under Runtime tab on the upper menu:

Image for post
Image for post

And click save.

Despite that CUDA libs are available for the Tensorflow environment, Colab does not have NVCC (Nvidia CUDA Compiler) installed. So, it is our next step.

!apt update -qq;
!dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb;
!apt-key add /var/cuda-repo-8-0-local-ga2/;
!apt-get update -qq;
!apt-get install cuda gcc-5 g++-5 -y -qq;
!ln -s /usr/bin/gcc-5 /usr/local/cuda/bin/gcc;
!ln -s /usr/bin/g++-5 /usr/local/cuda/bin/g++;
!apt install cuda-8.0;

You can just copy it in a cell in Notebook. Each line that starts with ! is going to be executed as a command line command.

There is no magic. We download CUDA debian package and install it. CUDA 8 requires GCC version less than 6 so we installed GCC 5 and made symlinks for successful compilation.

Now you can test your CUDA installation by running

!/usr/local/cuda/bin/nvcc --version

And the ouput should be something like

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Tue_Jan_10_13:22:03_CST_2017 Cuda compilation tools, release 8.0, V8.0.61

We are almost done. I created a small extension for running NVCC from Notebook cells. Install it with

!pip install git+git://

Now you need to load the installed extension, by running:

%load_ext nvcc_plugin

We are ready to run CUDA C/C++ code right in your Notebook.

For this we need explicitly say to the interpreter, that we want to use the extension by adding %cu at the beginning of each cell with CUDA code.

#include <iostream>
int main() {
std::cout << "Hello world\n";
return 0;

To check that everything works I encourage you to try matrix multiplication program from Nvidia course.

#include <stdio.h>#define N  64inline cudaError_t checkCudaErr(cudaError_t err, const char* msg) {
if (err != cudaSuccess) {
fprintf(stderr, "CUDA Runtime error at %s: %s\n", msg, cudaGetErrorString(err));
return err;
__global__ void matrixMulGPU( int * a, int * b, int * c )
* Build out this kernel.
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

int val = 0;
if (row < N && col < N) {
for (int i = 0; i < N; ++i) {
val += a[row * N + i] * b[i * N + col];

c[row * N + col] = val;
* This CPU function already works, and will run to create a solution matrix
* against which to verify your work building out the matrixMulGPU kernel.
void matrixMulCPU( int * a, int * b, int * c )
int val = 0;
for( int row = 0; row < N; ++row )
for( int col = 0; col < N; ++col )
val = 0;
for ( int k = 0; k < N; ++k )
val += a[row * N + k] * b[k * N + col];
c[row * N + col] = val;
int main()
int *a, *b, *c_cpu, *c_gpu; // Allocate a solution matrix for both the CPU and the GPU operations
int size = N * N * sizeof (int); // Number of bytes of an N x N matrix// Allocate memory
cudaMallocManaged (&a, size);
cudaMallocManaged (&b, size);
cudaMallocManaged (&c_cpu, size);
cudaMallocManaged (&c_gpu, size);
// Initialize memory; create 2D matrices
for( int row = 0; row < N; ++row )
for( int col = 0; col < N; ++col )
a[row*N + col] = row;
b[row*N + col] = col+2;
c_cpu[row*N + col] = 0;
c_gpu[row*N + col] = 0;
* Assign `threads_per_block` and `number_of_blocks` 2D values
* that can be used in matrixMulGPU above.
dim3 threads_per_block(32, 32, 1);
dim3 number_of_blocks(N / threads_per_block.x + 1, N / threads_per_block.y + 1, 1);
matrixMulGPU <<< number_of_blocks, threads_per_block >>> ( a, b, c_gpu );checkCudaErr(cudaDeviceSynchronize(), "Syncronization");checkCudaErr(cudaGetLastError(), "GPU");// Call the CPU version to check our work
matrixMulCPU( a, b, c_cpu );
// Compare the two answers to make sure they are equal
bool error = false;
for( int row = 0; row < N && !error; ++row )
for( int col = 0; col < N && !error; ++col )
if (c_cpu[row * N + col] != c_gpu[row * N + col])
printf("FOUND ERROR at c[%d][%d]\n", row, col);
error = true;
if (!error)
// Free all our allocated memory
cudaFree(a); cudaFree(b);
cudaFree( c_cpu ); cudaFree( c_gpu );

If everything was set up correctly you should see 'Success\n at the output of the cell.

All above steps should also be valid for your local environment. I didn’t have an intention to create a plugin for everyone, it’s mostly an accedent :) So, feel free to update it.

I hope it will be useful for someone.

Have fun!

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store