Running CUDA C/C++ in Jupyter or how to run nvcc in Google CoLab

Not that long ago Google made its research tool publicly available. Besides that it is a fully functional Jupyter Notebook with pre-installed Tensorflow and some other ML/DL tools, you can take advantage of Nvidia GPU. Yes a 11.5 GB Nvidia K80 GPU for free.

Not that long ago Nvidia announced its Deep Learning Institute where you can acquire basics of CUDA programming in both Python and C/C++. After the course you will get some good recommendations where to go next. Although, in era of ultrabooks not many of us have a dedicated graphic card on board and Nvidia even rarer. There are a few options, when you have more experience you can write your code offline and then upload to your Google/AWS/FloydHub GPU instance. However, when you just started it might be useful to compile your code and see results in an on-live solution.

I guess, I’m done with introduction. Let’s get our hands dirty and set up our learning environment. First, create a new Notebook.

Please select Python 3 Notebook in the pop-up window.

If you have worked with Jupyter before, the interface will look familiar. A bit more stylish though. If you haven’t, don’t worry. It’s a pretty simple and very powerful tool, that’s way it is so popular.

Next, we need to switch our runtime from CPU to GPU. We just 2 clicks away.

Change runtime type in Notebook settings under Runtime tab on the upper menu:

And click save.

Despite that CUDA libs are available for the Tensorflow environment, Colab does not have NVCC (Nvidia CUDA Compiler) installed. So, it is our next step.

!apt update -qq;
!dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb;
!apt-key add /var/cuda-repo-8-0-local-ga2/;
!apt-get update -qq;
!apt-get install cuda gcc-5 g++-5 -y -qq;
!ln -s /usr/bin/gcc-5 /usr/local/cuda/bin/gcc;
!ln -s /usr/bin/g++-5 /usr/local/cuda/bin/g++;
!apt install cuda-8.0;

You can just copy it in a cell in Notebook. Each line that starts with ! is going to be executed as a command line command.

There is no magic. We download CUDA debian package and install it. CUDA 8 requires GCC version less than 6 so we installed GCC 5 and made symlinks for successful compilation.

Now you can test your CUDA installation by running

!/usr/local/cuda/bin/nvcc --version

And the ouput should be something like

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Tue_Jan_10_13:22:03_CST_2017 Cuda compilation tools, release 8.0, V8.0.61

We are almost done. I created a small extension for running NVCC from Notebook cells. Install it with

!pip install git+git://

Now you need to load the installed extension, by running:

%load_ext nvcc_plugin

We are ready to run CUDA C/C++ code right in your Notebook.

For this we need explicitly say to the interpreter, that we want to use the extension by adding %cu at the beginning of each cell with CUDA code.

#include <iostream>
int main() {
std::cout << "Hello world\n";
return 0;

To check that everything works I encourage you to try matrix multiplication program from Nvidia course.

#include <stdio.h>
#define N  64
inline cudaError_t checkCudaErr(cudaError_t err, const char* msg) {
if (err != cudaSuccess) {
fprintf(stderr, "CUDA Runtime error at %s: %s\n", msg, cudaGetErrorString(err));
return err;
__global__ void matrixMulGPU( int * a, int * b, int * c )
* Build out this kernel.
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

int val = 0;
if (row < N && col < N) {
for (int i = 0; i < N; ++i) {
val += a[row * N + i] * b[i * N + col];

c[row * N + col] = val;
* This CPU function already works, and will run to create a solution matrix
* against which to verify your work building out the matrixMulGPU kernel.
void matrixMulCPU( int * a, int * b, int * c )
int val = 0;
for( int row = 0; row < N; ++row )
for( int col = 0; col < N; ++col )
val = 0;
for ( int k = 0; k < N; ++k )
val += a[row * N + k] * b[k * N + col];
c[row * N + col] = val;
int main()
int *a, *b, *c_cpu, *c_gpu; // Allocate a solution matrix for both the CPU and the GPU operations
int size = N * N * sizeof (int); // Number of bytes of an N x N matrix
// Allocate memory
cudaMallocManaged (&a, size);
cudaMallocManaged (&b, size);
cudaMallocManaged (&c_cpu, size);
cudaMallocManaged (&c_gpu, size);
// Initialize memory; create 2D matrices
for( int row = 0; row < N; ++row )
for( int col = 0; col < N; ++col )
a[row*N + col] = row;
b[row*N + col] = col+2;
c_cpu[row*N + col] = 0;
c_gpu[row*N + col] = 0;
* Assign `threads_per_block` and `number_of_blocks` 2D values
* that can be used in matrixMulGPU above.
dim3 threads_per_block(32, 32, 1);
dim3 number_of_blocks(N / threads_per_block.x + 1, N / threads_per_block.y + 1, 1);
matrixMulGPU <<< number_of_blocks, threads_per_block >>> ( a, b, c_gpu );
checkCudaErr(cudaDeviceSynchronize(), "Syncronization");
checkCudaErr(cudaGetLastError(), "GPU");
// Call the CPU version to check our work
matrixMulCPU( a, b, c_cpu );
// Compare the two answers to make sure they are equal
bool error = false;
for( int row = 0; row < N && !error; ++row )
for( int col = 0; col < N && !error; ++col )
if (c_cpu[row * N + col] != c_gpu[row * N + col])
printf("FOUND ERROR at c[%d][%d]\n", row, col);
error = true;
if (!error)
// Free all our allocated memory
cudaFree(a); cudaFree(b);
cudaFree( c_cpu ); cudaFree( c_gpu );

If everything was set up correctly you should see 'Success\n at the output of the cell.

All above steps should also be valid for your local environment. I didn’t have an intention to create a plugin for everyone, it’s mostly an accedent :) So, feel free to update it.

I hope it will be useful for someone.

Have fun!