Linear Regression: Closed-Form solution, the Dart way

Ilia Gyrdymov
9 min readJul 17, 2022

--

This post is a part of a series of articles on Machine Learning in the Dart programming language:

Hi all,

In my previous article, I explained the Ordinary Least Squares Problem. I mentioned that there are several ways to solve it, and one of them is the Closed-Form solution.

Today we’re going to:

  • figure out what closed-form is
  • derive the formula for closed-form solution for Linear Regression
  • touch a live example of Linear Regression using the Dart programming language

Let’s begin!

What is the Closed-Form solution?

Let’s look at an elementary example:

I suppose everyone is able to find x, it’s 3. If we substitute 2 with a and 6 with b, we can infer the formula for x:

Let’s now look at a more complex example:

As you may notice, it’s a quadratic equation. So, if we substitute 2 with a, 7 with b and 3 with c, we may use the well-known formula to find all the possible values of x :

Again, it‘s easy to find the answers: 3 and 0.5.

What is mutual between these 2 examples? Right, they both can be expressed by a formula, which implies reaching the only possible solution for a finite number of simple mathematics operations.

Let’s now look at another example:

rephrasing it in plain English, we have to differentiate the function of the square x.

Recall what differentiation is. It’s a process where we change the argument a bit and watch the function behaviour to understand how sensitive the function is to the change. We can express such a process by a formula:

As you can see, it isn't possible to find the exact value of y. We can alter the function's argument an infinite number of times and still have no final solution.

A case when we have the only possible solution for a problem which we can reach for a finite number of simple mathematic operations is called The Closed-Form Solution. The first two equations are great examples of such a solution; they can be done simply and converge to a single possible answer. The last one, obviously, doesn’t have a closed form.

Now we know what the Closed-Form is. Does the Linear Regression problem have such a solution? Yes, it does. To infer the formula, we should remember the Ordinary Least Squares problem— I recommend you to read the article about Ordinary Least Squares:

So, we have an error function to minimise:

where

w_0_0 and w_1_0 are unknown terms; we have to find them. As you remember, the solution to our problem is the minimum possible error. Analytically, it is the minimum point of the error function. Since the function is differentiable, we may just get an expression for the derivative of the function and make it equal to zero.

Why should we make it equal to zero?

As you remember, the value of the derivative is a slope of the tangent line of a certain point:

In the picture above, we see two tangent lines at the point A and the point B. Obviously, the lines have some slope, and it means that the absolute derivative value at these points is greater than 0.

If the derivative value is equal to zero, it means that the tangent line at the point has no slope; it is parallel to X axis:

It means that the point is the minimum of the function.

Of course, the point with such a tangent line may be the maximum as well:

but due to the nature of our problem, our error function has only a global minimum:

The parabolic plane has only one extremum, and it’s a minimum.

Ok, but the function consists of two unknown variables, w_0_1 and w_1_0. How can we find a derivative of such a function?

We may simply find an expression for a partial derivative with respect to each unknown term and make the expressions equal to zero:

In the example above, we derived expressions just for a single y term with a single feature — x term.

Imagine that we have a lot of features. It means that there can be a lot of x terms. Also, there may not be just a single y — usually, there can be many values. Let’s look at the problem from the matrix perspective: there is a feature matrix which contains x values, a weight matrix which contains wvalues, and an outcome matrix which contains y values.

According to this, let’s create a more general expression for the partial derivative:

where:

  • j is an index of the feature (or index of the feature weight), denotes a column in the feature matrix
  • i is an index of the observation, denotes a row index in the feature matrix
  • y_i is an outcome for the i-th observation (for the i-th row)
  • x_i_j is a particular feature value (a value from i-th row at j-th column in the feature matrix)
  • N is a number of features

Great, now we know how to find a derivative with respect to a feature weight based on a single observation, but we may have a lot of observations. According to the Ordinary Least Squares problem, if we have a lot of observations, we have to sum up all the error values for each observation:

where M is a number of observations.

Let’s consider the fact of the summation for partial derivative expression:

We should iteratively apply this formula to our data changing j term according to the following pattern: j = 0 … N-1 since we have N features and N weights.

To make things a bit easier, we may express outcomes, y values, as a matrix Y of M x 1 dimensionality, features, x values, as a matrix X of M x N dimensionality, and weights, w values, as a matrix W of N x 1 dimensionality. Let’s rewrite the equation above in matrix notation:

Let’s check the dimensionality of matrices in the equation:

  • product of X and W gives us a column-matrix of M x 1 dimension since we multiplied X matrix of dimension M x N by W matrix of dimension N x 1. There is a simple rule of thumb: the dimensionality of the output matrix is number of rows of the first matrix x number of columns of the second matrix, that’s why we get the M x 1 dimensionality matrix as a result.
  • Subtraction of Y and XW is valid since both terms are matrices with the same number of rows and columns.
  • We had to transpose X matrix to conform to the dimensionality since Y-XW is a matrix of M x 1 dimensionality, but X matrix is of M x N size: we can’t multiply a matrix with dimension M x N by a matrix with dimension M x 1. After the transposition, X matrix has N x M dimensionality, which is perfect, and now we can find the product. As a result, we have N x 1 matrix.

Great, the dimensionality of our equation conforms. Our target is W matrix. Let’s express it from the equation.

First, let’s get rid of -2, since we may divide both parts of the equation by this value:

Then, let’s get rid of the parentheses. To do so, we should multiply X (transposed) by Y, and subtract the product of X (transposed), X and W from it:

Now let’s move the first term of the equation to the right part:

And the final step — now we may multiply both parts of the equation by:

In the matrix world, to multiply a matrix by its inverted version (the source matrix to the power of minus one) means the same as to multiply a number by the number reciprocal, like 5 * 1/5. In the number world, the latter example results in 1. In the matrix world, multiplication by the inverted matrix is analogous; it results in an identity matrix, which is almost the same as 1 for numbers. So after we multiplied both parts of the equation by the inverted matrix, we have:

We got a formula for a closed-form solution for the Linear Regression task.

Let’s code!

Using ml_linalg library, we may easily create a program for finding W matrix according to the formula above.

It’s better to come up with some synthetic data which is easy to validate. Say, let the following records:

x1 | x2
-------
2 | 3
4 | 6
6 | 12
8 | 24

will be our feature matrix X, and the following values:

y
--
13
26
48
88

will be our outcome matrix Y

It’s really easy to find the weights of the features by hand — it’s 2 and 3:

2*2 + 3*3 = 13
4*2 + 6*3 = 26
6*2 + 12*3 = 48
8*2 + 24*3 = 88

Let’s apply our formula to the data and check if it works well:

The last instruction prints the following:

Matrix 2 x 1:
(2.0000219345092773)
(2.9999969005584717)

the values look valid. It’s exactly what we expected (with some round-off errors, though).

It seems that we performed Linear Regression just in one line of code (line 16 in the snippet above)!

What can stop us from using this simple and elegant solution every time?

Let’s try to analyze the complexity of the algorithm. It has a matrix inversion step which has cubic complexity — O(n³). Imagine, if your feature matrix is 100,000 by 1,000,000. So, the total number of iterations just to inverse the matrix is roughly (100,000*1,000,000)³! Even SIMD architecture, which stands behind ml_linalg library, won’t help us; there are too many iterations. That’s why we can’t use the closed-form solution for Linear Regression every time.

There is a different technique to overcome the big data barrier, called Gradient Descent, but it’s too much for today. Let’s leave it for the next article.

And that’s pretty much it. Now you know what a closed-form solution for Linear Regression is!

Cheers :)

--

--

Ilia Gyrdymov

Frontend engineer (Dart, Vue/Vuex/Nuxt, React/Redux + Typescript) with an interest in Machine Learning, living in Cyprus