# [ Google ] Continuously Differentiable Exponential Linear Units with Interactive Code [ Manual Back Prop with TF ]

Jonathan T. Barron is a researcher at Google and he proposed CELU() which means “Continuously Differentiable Exponential Linear Units”. In short this new activation function is differentiable every where. (For general ELU when alpha value is not set to 1 it is not differentiable everywhere.)

Since I recently covered “Fast and Accurate Deep Networks Learning By Exponential Linear Units (ELUs)” (Please click here to read the blog post) it naturally make sense to cover this paper next. Finally for fun lets train our network using different optimization methods.

Case a) Auto Differentiation with ADAM optimizer (MNIST dataset)
Case b) Auto Differentiation with ADAM optimizer (
CIFAR10 dataset)
Case c) Manual Back Prop with
Case d)
Dilated Back Prop with AMSGrad optimizer (CIFAR10 dataset)

Continuously Differentiable Exponential Linear Units

Left Image → Equation for CELU()
Right Image → Equation for ELU()

Above is the equation for CELU() is, we can already see that it is not that different from original ELU(). Just one difference of dividing the x value with alpha when x is smaller than zero. Now lets see how this activation function looks like.

Left Image → How CELU() and its derivative looks like when graphed
Right Image → How ELU() and its derivative looks like when graphed

On the most right image (Derivative of ELU() ) we can observe that the function is not continuous, however for derivative of CELU() we can observe that the function is continuous everywhere. Now lets see how to implement CELU() and it’s derivatives.

First lets take a look at the derivative of CELU() respect to input x.

When implemented in python (Tensorflow) it would look like something above, and please note I have set the alpha value to 2.

Network Architecture

Red Rectangle → Input Image (32*32*3)
Black Rectangle → Convolution with CELU() with / without mean pooling
Orange Rectangle → Softmax for classification

The network that I am going to use is just seven layer network with mean pooling operation. Since we are going to use MNIST data set as well as CIFAR 10 data set, the number of mean pooling was adjusted accordingly to fit the image dimension.

Case 1) Results: Auto Differentiation with ADAM optimizer (MNIST dataset)

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

Well for MNIST data set it was an easy 95+% accuracy for both test images as well as training images. I normalized the cost overtime for both training / testing images to make the graph look prettier.

Final accuracy for both testing and training images were 98+%.

Case 2) Results: Auto Differentiation with ADAM optimizer (CIFAR10 dataset)

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

Now we can observe the model starting to suffer from over-fitting. Especially with Adam optimizer the accuracy for test images stagnated at 74 percent.

Final Accuracy for testing images was 74% while 99% for training images, indicating over-fitting of our model.

Case 3) Results: Manual Back Prop with AMSGrad optimizer (CIFAR10 dataset)

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

For this experiment AMSGrad did better than regular Adam optimizer. Although the test image accuracy was not able to pass 80% and the model still suffered from over-fitting it, gave better results than Adam by 4 percent.

Final accuracy on test images were 78% indicating the model is still over-fitted but 4% better results than using regular Adam without any regularization techniques, not bad.

Case 4) Results: Dilated Back Prop with AMSGrad optimizer (CIFAR10 dataset)

Purple Arrow → Regular Gradient Flow from Back Propagation
Black Curved Arrow → Dilated Back prop for increased Gradient Flow

With that architecture in mind, lets see how our network performed on the test images.

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

Again it did better than regular Adam Optimizer, however it was not able to pass 80% accuracy on the test images. (Still)

Final accuracy was 77% on testing images, 1% less when compared to regular back propagation with AMSGrad, but 3% better then regular Adam.

Interactive Code

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding! Also for transparency I uploaded all of the log during training.

Final Words

It was interesting to see how each network performed different from one another depending on how they were trained. CELU() seems to perform better (at least for me) than ELU() activation function.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.