Neural ODEs (An Intuitive Understanding of the Basics)

Published in

Red Buffer

10 min readJul 30, 2019

Neural Ordinary Differential Equations try to solve the Time Series data problem. It’s a new approach proposed by University of Toronto and Vector Institute. This paper was awarded the best paper of Neurips 2018. This covers a new technique to solve problems that has opened so many doors to the conventional machine learning approaches.

Seems like this paper, just like GANs (link to my medium post) is going to be the next big thing in Machine Learning Theory.

What We’re Covering

Since this particular topic is highly mathematical. I will try my best to keep things more intuitive and simple. The goal of this post is to give everyone an idea of

What is this approach?
How this approach works?
Why it works? (Covered in 2nd blog)

This is a 2 series blog. This is the first one, that covers the basics. To understand the dynamics of a Neural ODE and it’s training I will be writing the second part of this blog. So stay tuned!

To understand the basics, we are going to cover the background that will help us in putting up the pieces together.

Basics of Machine Learning
Basics of Neural Networks
Basics of Residual Neural Networks
Differential Equations
Ordinary Differential Equations
Partial Differential Equations
Ordinary Differential Equation Solvers
Euler’s Method

Machine Learning With Optimization

The main idea for Machine Learning is to solve something. When working with mathematics, we get some input, we apply some function to it and get the output.

For example, when converting temperature from Fahrenheit to Celsius scale. We are given some input x in Fahrenheit degrees, we apply a function f to it and receive an output y in Celsius scale.

(x°F − 32) × 5/9 = y°C

Here we apply this function to our input x and we receive an output y.

But what if we don’t know the function f?
This is where Machine Learning jumps in. We are given the input x and the output y. And we need to learn the function f that is going map x to y.

The picture above shows the difference between the two approaches. Now that may give an idea that machine learning should always be preferred. This is not the case. Machine Learning in reality approximates a function f.
Which means it will try to map x to y. But the answer won’t always be correct.

Now the way machine learning works is that it will take the input x, apply some random transformation using a linear function (mx + b) to it and produce an output y’.
Here random means that the slope m and the bias b are random.

This y’ will then be compared to the real y. This will give us some error value. This error value is then used to change the random transformation in such a way that the next time it produces some y’. It’s error with the y, is lesser. At some point we are going to achieve a state where we can say that the error is significantly low and the output it’s producing is very much near to the real output.

We are then going to assume that we have found the almost correct values of m and b for which x maps to y.

But there’s a catch here. Two actually.

How do we know what values of m and b to change to after we compute the error?
This is a linear function, how do we model if the mapping between x and y isn’t linear?

To answer the 1st question, we will need to learn what optimization in machine learning is and how it works.

Optimization is a process of using some loss (error value, we’re going to use them interchangeability).
Since we are going to be optimizing the m and b because they’re the learnable parameters. We are going to check how much changing them changes the value of y.
Does that seem familiar?

Let’s go to school again. Remember the infamous equation that said how much change in one variable if we change another?

Yes

The differential equations.

So in our case, we have d(y) / d(m) and d(y) / d(b) (This is a partial derivative. We’ll come to this later)

Basically we want to see how much change in m and b is going to change the y. We take the partial derivative because we want to only change m and b and not x. Because x is an input and shouldn’t be changed. Learning rate determines how big of a change we want to make.

The 2nd question is answered with something called an Activation Function. An activation function is basically a function that helps model a non linear or complex mapping between x and y. There are a lot of activation functions which I am not going cover in this article. But one thing to know is that these functions should be differentiable.

y’ = ACTIVATION_FUNTION( mx + b )
L = Error(y, y’)
d(y) / d(L) = 1
d(y) / d(m) = ( d(L) / d(m) ) * ( d(y) / d(L) )
d(y) / d(b) = ( d(L) / d(b) ) * ( d(y) / d(L) )
m = m — (learning_rate * ( d(y) / d(m) ))
b = b — (learning_rate * ( d(y) / d(b) ))
REPEAT

Neural Networks

Sometimes and if not always we don’t end up with x that can best help us simply optimize to a function f that maps straight towards y.

WE NEED MORE FEATURES.

This is where Neural Networks come into action, they can map our input features x to a higher dimension.

The steps involved in optimization are the same. Except for some changes.

y1 = ACTIVATION_FUNTION( m1 x + b1 )
y2 = ACTIVATION_FUNTION( m2 y1 + b2 )
y3 = ACTIVATION_FUNTION( m3 y2 + b3 )
….
….
….
y’ = ACTIVATION_FUNTION( mn y(n-1) + bn )

The idea is to basically add layers like these.

In the image we can see that in the 2nd layer, we have increased the number of features. So we now have more features.

The gradients flow the same way. Again everything here is differentiable. And it’s important

y1 = ACTIVATION_FUNTION( m1 x + b1 )
y2 = ACTIVATION_FUNTION( m2 y1 + b2 )
y’ = ACTIVATION_FUNTION( m3 y2 + b3 )
L = Error(y, y’)
d(y) / d(L) = 1
d(y) / d(m3) = ( d(L) / d(m3) ) * ( d(y) / d(L) )
d(y) / d(b3) = ( d(L) / d(b3) ) * ( d(y) / d(L) )
d(y) / d(m2) = ( d(L) / d(m2) ) * ( d(y) / d(m3) )
d(y) / d(b2) = ( d(L) / d(b2) ) * ( d(y) / d(m3) )
d(y) / d(m1) = ( d(L) / d(m1) ) * ( d(y) / d(m2) )
d(y) / d(b1) = ( d(L) / d(b1) ) * ( d(y) / d(m2) )
m3 = m3 — (learning_rate * ( d(y) / d(m3) ))
b3 = b3 — (learning_rate * ( d(y) / d(b3) ))
m2 = m2 — (learning_rate * ( d(y) / d(m2) ))
b2 = b2 — (learning_rate * ( d(y) / d(b2) ))
m1 = m1 — (learning_rate * ( d(y) / d(m1) ))
b1 = b1 — (learning_rate * ( d(y) / d(b1) ))
REPEAT

So we can think of a neural network as a one giant function of functions within functions.

Now how do we decide the number of layers. The answer to that is, we don’t know how to

Usually when we have lower number of layers we are not modelling any complex problem or mapping, but as the problem gets complex, the general idea is to increase the number of layers.
But it’s very difficult to train a neural network with big number of layers.

There are two specific reasons,

Overfitting
Vanishing Gradients

Overfitting can be explained by the diagram below

Overfitting can be defined as to overfit the training data, as to specialize for only the training data, the other extreme is underfitting which is just generalizing using some line or a plane.

The other problem is Vanishing Gradients. Which states that gradients start vanishing when we increase the number of layers. This causes the model to die or not converge at all.

Residual Neural Networks

A residual Neural Network is exactly the same as a neural network with just one slight modification.

In a conventional neural network, an input from one layer is passed to the next layer. But in a resnet(will be using residual network and resnet interchangeably), we don’t just send the output of some k layer to k+1 layer. Instead we input the output of k layer added with the input of k layer.

xk+1 = xk +Fxk

F is a function that represents one layer. F(xk) represents the output of k layer. xk represents the input of kth layer. They are both added before we go to the next layer.

Derivatives vs Differential Equations

Differential Equations is an equation that involves derivatives of a function. It could be any order of derivative.

Derivative of a Function

Assume we have a function, any function and we want to find it’s slope at some point. The derivative of a function at that point will give you the slope of the function.

Hence derivative is just an operator, whose input is a function and output is also a function.

Differential Equation

A differential equation is an equation that contains the derivatives operators and these derivatives tells us about the original equation itself.

So a differential equation describes the relationship between the functions and its derivatives.

For example

f’’(x)+2f’(x) +f(x) = 7

Differential Equations are divided into two main parts.

Ordinary Differential Equations
Partial Differential Equations

Partial Differential Equations

In a partial differential equation we have partial derivatives of multiple independent variables inside the equation.

Ordinary Differential Equations

In an ordinary differential equation we have only one single independent variable’s derivative, we call that equation as Ordinary Differential Equation.

In the examples above, we only have one single independent variable x.

A simple ODE would be df = f’(x).dx

Solving an Ordinary Differential Equation

We have an equation with derivatives. We have a differential equation at hand. Now we want to find the original function. The methods to do that are called Ordinary differential Equation Solvers.

Now there are several methods of solving and ODE. But they usually involve taking the anti-derivative or the Integral of the function.

Let’s try to solve a simple ODE.

d(f) / d(x)=f’

F(x) = ∫ f(x)dx

F’(x) = f

Euler’s Methods

Now one way to solve a first order differential equation is by a method called Euler’s Method. Euler’s method is a numerical method, which means it is solved in discrete time steps.

Remember the goal is to approximate the function from a differential equation.

A very good explanation of Euler’s method is given in this link.

Basically what the method states is that given some initial point (x0, y0) and the differential equation. We can estimate the original function using this method.

y(t+δ)=y(t)+δf(t,y)

This is the formula for Euler’s method. We are not going to see how it’s derived, but in fact see how it works.

Imagine we want to estimate the function that is the blue line. Euler’s method says to start at some initial point A0, draw a tangent using the differential equation and reach the point A1 and so forth. (This is very clearly shown in the video link).

In the residual networks we saw a very similar formula. But in Neural Networks (Residual Networks included), we look at each layer as a discrete unit and try to optimize it’s weights.

Let’s Relax

Too much math for now. Let’s sit back and understand what we have till now. Currently we have studied two main things.

Machine Learning Basics

Approximate a function f based on a hypothesis that there is a relationship between x and y.

Some basic Calculus

Approximating a function f based on a given a given ordinary differential equation.

The goal in the above two definitions is the same. We are trying to approximate a function. But in the calculus part we are actually using a differential equation solver (euler’s method) to get the function back.

So in Neural ODE, we are using Euler’s method to solve something that looks like a residual network but has just one continuous unit instead of many discrete units.
And the way to optimize is that we use the
1. Change in Loss wrt parameter (one)
2. Change in Loss wrt hidden state (time t)
3. Changes in hidden states themselves

The goal of this blog was to give a relatively simpler understanding of the basics to understand the Neural ODE. I will be covering the Neural ODE itself in the second part of this series. So stay tuned and Thank you.