Gradient Descent, the Dart way

8 min readSep 6, 2022

This post is a part of a series of articles on Machine Learning in Dart:

In my last article, we got familiar with a concise and elegant solution to the Ordinary Least Squares problem — The Closed Form Solution.

This article will dig into another optimization algorithm — gradient descent. In the end, we will code the algorithm in Dart programming language.

Table of contents:

The intuition behind the Gradient Descent
Mathematical foundation of the algorithm
Implementation of the Gradient Descent in Dart programming language

The intuition behind the Gradient Descent

Let’s remember what our goal is: we have to find the minimum of the cost function, which is the squared error in our case. The function’s typical graph:

The graph looks like a pit. So let’s imagine that we are staying at the edge of it. It’s completely dark, and we have to find the bottom of the pit.

Let’s take a step back. It’s evident that we are ascending; we can feel it, ascending is always harder than descending. We were wrong, backward direction won’t lead us to the bottom. So let’s change the direction and take some steps forward. Now we can feel that we are descending. After a lot of steps, the pit becomes less steep. Finally, we reached the bottom, but we don’t know it yet. We take a step forward, we feel that we start to ascend, so we need to step back. Now we perfectly know that we are at the bottom of the pit.

We’ve just described the gradient descent algorithm. To make a computer understand it, we should first translate the intuitive description into math and then into code in a programming language.

Mathematical foundation of the algorithm

A step in the context of math means finding a derivative of the function at some point. Derivative, in its turn, means how steep the function is at the point:

a great absolute value of the derivative implies that the function is quite steep at this point;
a small value means that the function is relatively flat at the point;
a value equal to zero means that the point is the local optimum of the function;
a positive sign of the derivative means that the function is ascending;
a negative sign of the derivative means that the function is descending.

Formally, the minimum is the coordinates of the lowest point. For every iteration of the gradient descent procedure, the coordinates are updated.

Here we should come up with an update rule.

We need to change the coordinate in the direction of the minimum, but how can we determine the direction? It’s easy — if the derivative at the new point is negative, it means that the function is descending, and we may move the coordinate to the right:

If the derivative at the point is positive, the function is ascending, and we need to change the direction and move the coordinate to the left:

If the derivative is zero, we are at the minimum:

The following rule describes the process the best:

In the case of a negative derivative, we move the coordinate to the right:

In the case of a positive derivative, we move the coordinate to the left:

For more precise convergence, it’s better to add a coefficient parameter for the derivative:

It’s called learning rate.

When to stop the process? Let’s subtract the new and current coordinates from one another:

If the result value is too big, we should repeat the process.

Our next challenge is finding out what to do with the cost function of many arguments. Recall, what the function is:

where M is the number of features, i is the index of a row from the feature matrix, j is the column index.

Let’s find an expression for derivative value with respect to every argument. Since our function is “complex”, meaning it consists of several parts, we must apply the chain rule. First, let’s find the derivative of square:

Then let’s find the derivative of the inner part of the function:

And finally, let’s multiply both expressions:

Great, now we have a bunch of partial derivatives. So we need to organize them somehow. Here we come to the Gradient vector definition:

The gradient is a vector consisting of partial derivatives.

The update rule stays the same. We need just to rewrite it in a vector notation:

The detailed gradient looks like this:

But there are a lot of rows in the matrix; how can we calculate the gradient if every row results in its own gradient? It appears that we may just sum up the gradients of all rows:

So the rule now looks as follows:

Implementation of the Gradient Descent in Dart programming language

Let’s try to code the gradient descent procedure in Dart!

I’m going to use the ml_linalg package. The library simplifies work with vectors and matrices. Also, it helps us do math effectively because of the library’s SIMD nature.

The procedure will be organized as a function. Let’s come up with its signature:

To perform the gradient descent, we need a feature matrix X, a column matrix Y with labels and a column matrix initialCoefficients. The function’s output is a column matrix of coefficients, which correspond to the minimum point of the cost function.

The gradient descent is an iterative procedure, meaning we must define stopping criteria. The most obvious criterion is the iteration count constraint. The second criterion may be the difference between two contiguous iterations; if it is less than or equal to some limit, we should stop the procedure:

Let’s break it down:

we chose a value 50 as the iteration count constraint. Hopefully, it will be enough to converge.
If the difference between new and current coefficients is less than minCoefficientDiff, the algorithm is converged, and we should stop the procedure. Mathematically, we are at the optimum of our cost function since the coefficients changed a bit; the derivative at this point is almost zero.
The difference should be a number, but the subtraction of new and current coefficients gives us a column matrix; that’s why we used the matrix method norm. It’s completely correct since the less the difference between new and old coefficients, the less the norm of the difference. By assigning newCoefficients to coefficients we linked the loop to begin a new iteration. New coefficients are no longer new, they are current ones.

Let’s code the coefficients update rule. According to the formula which we inferred above for the gradient:

we should take each element from i-th row of the matrix X one after another and multiply it by an expression in parentheses. The expression, in turn, multiplies the entire i-th row by the current coefficients. Seems we can shorten the expression using matrix notation. Why should we take each element from i-th row of X and multiply it by the expression? We may use the entire matrix X! The same applies to the expression in parentheses: we can multiply the whole matrix X by the current coefficients:

The code:

To conform to the dimensionality, we must transpose X. The last line is the update rule.

Let’s now collect all the pieces all together:

Great, let’s test it on a simple example:

It’s easy to infer the coefficients without the gradient descent:

2 * w1 + 2 * w2 = 12; w1=w2=3
3 * w1 + 3 * w2 = 18; w1=w2=3
4 * w1 + 4 * w2 = 24; w1=w2=3
5 * w1 + 5 * w2 = 30; w1=w2=3

The first feature coefficient is 3, the second feature coefficient is also 3. Let’s run the code from above and look at the output. The output is:

Coefficients: Matrix 2 x 1:
(2.9999842643737793)
(2.9999842643737793)

Both coefficients inferred by the gradient descent are almost 3 as expected (with some round-off error). Seems the code works well.

Summary:

The gradient descent is much cheaper to compute than the closed-form solution, it doesn’t require expensive matrix inversion operations. It’s possible to use the algorithm on a significant amount of data.
The gradient descent is not as accurate as the closed-form solution, since the closed-form solution is the “ideal” one.
The gradient descent can be applied to a non-linear problem, unlike the Closed-Form solution.

That’s pretty much it!

Thank you for reading :)

Gradient Descent, the Dart way

The intuition behind the Gradient Descent

Mathematical foundation of the algorithm

Implementation of the Gradient Descent in Dart programming language

Written by Ilia Gyrdymov