Writing a new Tensorflow operation (including C++, CUDA, forward, gradient and grad check)

Jesper Taxbøl
Mar 12, 2018 · 7 min read

Usually I am working with vision problems using openCV and Tensorflow. The seperation between the two has always been enforced hard, which mean that I have used “regular programming”, using knowledge about geometry etc on CPU and passed images to tensorflow for inference. Sometimes an idea arise where I would like to cross that border, and use geometry knowledge in my inference graph.

The past year my ideas in this direction has been abandoned because of complexities encountered during implementation. I usually end up wasting bandwidth and processing power to such a level that my prototype can not move on. But that is a different story.

The point is that I have had an idea for a custom operation that would integrate geometry knowledge in my Tensorflow graph. This article serves as my notes in the process of learning how to do a custom operation for Tensorflow. I have chosen a simple dense layer as an example.

On shoulders of giants

This work is based on work of others. The closest resources are be referred here.

1) Tensorflow has an article covering writing a custom op

It covers writing a C++ implementation, lacks gradient computation and rely on Bazel as buildengine. I have not learned to love Bazel yet.

2) The second article I found that really got me started was this one.

It covers implementing an operation in C++ with both forward and gradient passes implemented, as well as using Cmake as build engine to produce a portable .so module. The module can be used from python based Tensorflow scripts. This project worked out of the box for me and was what really goyt me started.

3) The last article worth mentioned was.

It covers writing an op in C++ as well as implementing it in cuda for GPU acceleration.

Borrowing pieces from these three sources allowed me to implement a simple dense layer operation for both CPU and GPU acceleration.

Full sourcecode for this article is shared here:

Forward pass

The forward pass is easy and intuitive. It follows the classical description where a neuron has a bias and one weight per connection to each preceeding neuron. In this case the operation deals with a whole layer of neurons and not just a single one at a time.

Note that the activation function is left out to keep complexity low. I basically just add it after the custom operation in python.

The forward pass is straightforward.

for (int sample = 0; sample < batch_samples; sample++)
for (int unit = 0; unit < units; unit++)
output_tensor(sample, unit) = 0;
for (int input = 0; input < input_feature_width; input++)
output_tensor(sample, unit) +=
input_tensor(sample, input) *
weights_tensor(input, unit);
output_tensor(sample, unit) += biases_tensor(0, unit);


Implementing the operations gradient is needed for back-propagation to work. It stumped me for a while exactly how to compute the gradient as well as understanding how one could output a gradient in the form of a X dimensional matrix as seen in reference 2.

I fiddled a bit with the mathematics and finally wrote out the equations completely for each component of the vector. It was tedious and took a fair bit of time, but made some sense and got me started. Doing the matrix computations with different sizes for dimensions actually help me fit the pieces together.

Image for post
Image for post

Looking at the inner product code from the 2. article, I got a hint on what the gradient method gets as input and what it should produce in its output. In my understanding the gradient function gets a grad matrix being the same shape as the output of the forward pass. The matrix contains a part of the graph error that the back-propagation algorithm has transported backwards through the graph. The back-propagation method need the gradient operation to decide how to let the error flow further backward where it is ultimately used to adjust trainable values. (weights and biases)

The practical output of the grad operation should be matrices shaped to match all inputs that the operation is given.

Examining each component of each output of the grad output, reveals that the gradient is relatively simple computed. Its basically just expressing how much does the output change relative to the input and multiply with the gradient being given containing the size of the error at that point.

Initially I could, partly through guessing, come up with a hand built expression for each matrix I needed to produce, but I could not seem to express just how I got to the expression.

I was thinking about how this could be done in a more systematic way I came to the insight that the forward pass expresses all the connections between input and output by adding small contributions. If I could only run the forward pass and accumulate the “inverse” operation I might have a systematic approach.

This attempt can be seen here (with the forward pass operation commented out):

//Output matrices are set to zero outside loopfor (int sample = 0; sample < batch_samples; sample++)
for (int unit = 0; unit < units; unit++)
//output_tensor(sample, unit) = 0;
for (int input = 0; input < input_feature_width; input++)
/*output_tensor(sample, unit) +=
input_tensor(sample, input) *
weights_tensor(input, unit);*/

grad_input_tensor(sample, input) +=
weights_tensor(input, unit )*grad_tensor(sample, unit);
grad_weights_tensor(input, unit ) +=
input_tensor(sample, input)*grad_tensor(sample, unit);
//output_tensor(sample, unit) += biases_tensor(0, unit);
grad_biases_tensor(0, unit) += grad_tensor(sample, unit);

It can be seen for the forward pass that

output += input*weights

Having to express the error that need to be fed back to the input goes something like this. Input effects output multiplied by the weights. So the error being fed from output to input should be multiplied by the weights as well when being fed back.

grad_input += weights*grad_output 

Relevant indices need to be added but they can be copied from the forward pass directly.

It feels a bit hacky, but gets the job done and seems to work for my test example.

When I get more time I will dig into how exactly the grad matrix is done inside Tensorflow and how I can get to the same result through plain differentiation. The following part explains why I can satisfy my need with this approach for now.

Verifying the gradient method

Having hand built a function mostly through intuition always leaves some doubt that it is flawed. Gradient checking is a method described in this video by Andrew Ng that can help build confidence in that the gradients are computed correctly.

It basically says that the gradient is equal to the slope of the output of the forward pass. The gradient can be computed numerically by sampling the forward pass at two close points points. This can be compared to the analytically computed slope. If the two estimates are close we believe the gradient is correct.

One thing that stomped me for a while is that it is needed that the output of the graph being tested ends up in a single scalar, as that scalar is being assembled to a size so it can be compared to the input. In the test case I have used reduce_sum to make sure the dense operation ends up in a single scalar.

One insight on the way was that I needed to use float64 precision to get the kind of precision that the video explains as good (Lower than 10e-7). I wonder if I can safely go back to float32 precision.

Implementing this method verified that my intuitive gradient computation method actually seem to work. At least for this case and the random weights I have initialized the graph with. :)


The cuda implementation could be beefed up to be faster by flattening the input to the operation. In this way the gradient operation can be computed in one cuda operation instead of three. But this is a detail I will keep for later.

Currently the build process is a bit ugly, but it works and allow me to move on. I used this command when building.

rm -rf build && mkdir build && cd build && cmake .. && make && cd .. && python dense_tests.py


The code contain a number of tests that try out both GPU and CPU implementations work, as well as doing a gradient check.


I built a small test of a network in main.py where the operation is tested on a simple curve fitting assignment. Its fun to play with different activation functions to see how fitting happens.

python main.py

Extra notes

It seems that tensorflow does smart things to the tensors when registering the operation as CPU of GPU. I wondered for a while on how to keep a tensor on the GPU between operations to save constant uploading and downloading tensors to and from the GPU. It seems that Tensorflow does this for me automatically. Which is nice.


Having built a simple operation in Tensorflow including computing gradients and GPU acceleration has given me the confidence I needed to move on and actually start implementing my custom operation. Work can now begin in this direction.

I hope these notes can help you do something similar.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store