Building a polynomial regression model factory in NodeJS

Published in

guidesmiths

9 min readJun 3, 2020

Last month I had to solve a typical Maths problem related to fitting a model given a set of data. Instead of dealing with spreadsheets as I use to do, I decided to automate the process by building a NodeJS library and publishing it via NPM. Here is how I did it leveraging functional programming and dynamic computation.

“Artwork” created using the graphs & code in this post.

Why would we need to use this?

Imagine you want to sell your house but you have no idea of which is the appropriate list price. You just have a few references from other houses sold in the previous month in your neighbourhood. These references include the list price and the house area. Your house is not similar to those references but somehow you can estimate/interpolate your list price by building a polynomial function from the data with the house area as the independent variable. Then, in this super simple case, you can provide your house area as an input and the function will give you the list price according to the market.

The goal was to have a quick & easy way of calculating n-degree polynomials fitting any given dataset.

What do we need to do?

Given a set of data points, we need to determine the polynomial that best fits those points. To do so, we will go through the following steps:

Choose the degree for our polynomial space.
Build a system of equations according to the chosen degree.
Solve that system and getting the solution’s coefficients.
Store the coefficients and let us use them for estimating unknown values or plotting the functions.

A brief Maths introduction

Nowadays, machine learning techniques are everywhere. There is no doubt that most of the algorithms involved require a deep knowledge of statistics and mathematical analysis. However, sometimes we can easily analyze, visualize and understand some of them. Fortunately, this is going to be our case here.

One of the most basic techniques is polynomial regression, this is when we try to determine a function like this: f(x)=a₀+a₁x + … + aᵢxⁿ, that is the best fit possible for a given dataset.

How can we determine which is the function that best fits our dataset?

By using a technique called Discrete Least Squares Approximation. Basically, this means that our function will be the one that minimizes the distance between our data points and itself among all the other possible functions defined in that space. In other words, our goal is to minimize the sum of squares of the deviations in f(x) from each data point(x,y) y-value.

A visual example of the sum of squares concept. Ref here

How can we calculate the distance between our desired function and the dataset?

Here come the matrices and equations that allow us to perform that calculation telling us which coefficients make our f(x) the best fit possible.

I am not going to deep dive into the theory behind this approach, if you are interested, there are pretty good explanations here and here and probably in every Maths book covering interpolation topics.

First, we need to choose a base for our polynomial space — base={1, x, x²,…,xⁿ} — and then build its corresponding Gram’s matrix (we will cover this later). You can see that our base’s length depends on the degree we choose for it. Traditional linear regression means choosing a polynomial degree equal to 1 so our base will be {1, x} and our f(x) will look like f(x)=a₀+a₁x. The higher the degree, the more our f(x) will be able to fit the given dataset. However, this is not always desirable because this can lead to an overfitting problem (given data can contain “noise” and we do not want our model to strictly replicate it).

Before moving on, let’s try to clarify what we are talking about. Given two data points, the best fit possible will be a 1-degree polynomial, in plain words, this means building a straight line between the two points. If we are given three data points, the best fit possible will be a 2-degree polynomial and the same idea applies for higher orders.

L: Example of 1-degree polynomial fitting 2 points < > R: Example of 4-degree polynomial fitting 5 points

In these cases, where degree=n-1 with n being the number of points, the distance between our function/model and the dataset will be exactly zero because it matches each data point. When we are dealing with a thousand data points, we could calculate a 999-degree polynomial that matches exactly the given dataset, however, as said before, this won’t lead to a better model because it will reflect the noise and our estimations won’t be trustable. Apart from that, computationally, solving that system would take a valuable amount of time.

That’s why we will be, in most cases, dealing with many data points and our polynomial’s degree will be much smaller. This means that the distance between our model and the dataset won’t be zero but it will be the best fitting possible and it will be useful for estimating unknown values.

The solution

Let’s go through the actual solution to our problem. We need to build a system of equations like this:

The first matrix is Gram’s matrix. Each of the elements in this matrix is the result of calculating the scalar product between the different combinations of functions in our base {1, x, x²,…,xⁿ}. The standard scalar product defined for ℝⁿ is this:

Where x₀+x₁ … are the different data points x-values in our problem and f and g are the corresponding functions according to their position in the matrix.

The vector (α₀, α1 …) is the one containing the params we are trying to determine for our model f(x)=a₀+a₁x + … + aᵢxⁿ.

The right term contains the scalar products calculated, being, in this case, f(x) our original data points y-values.

Now we are ready to start writing the actual code!

The code

Our library will be structured into two folders. The first one will contain the code for building these matrices so we can create the system that we need to solve. The second one will contain the code to solve the system and calculate our model’s parameters.

Here you can find the lines of code I’ve written that let us build the required system using a functional approach.

You can notice that Gram’s matrix is symmetric (m[i][j]=m[j][i]) so we could speed up its building process by caching half of the values, excluding the ones in the main diagonal. For the sake of simplicity, I didn’t write that part in the gist.

Now let’s take a look at our solver folder. Here we are implementing an algorithm to solve the system by using the Gauss-Jordan method which basically consists of triangularizing the left-side matrix until we can start back-substituting the different values for our unknown params.

If you are interested in this easy and popular way of solving systems of equations you can read more here.

You can notice the dynamic approach we are using in these steps, each iteration needs the previous iteration calculations to work properly. E.g. for calculating the solution’s parameters array, each value depends on the previous calculated ones, because the array is being filled with one value per each iteration.

It’s worth mentioning that in this systemSolver file I couldn’t make use of Array.map and used instead standard for loops because we need to dynamically work with the changing rows of the augmented matrix. Map method receives a param pointing to the array you are iterating over, however there is no way to access the values of the array you are building during the iteration process.

Finally, we can go through the code in the library’s index file. Here we put all the pieces together in the fit method and we expose our API.

The API — Usage

The previous section shows how we expose a unique entry-point, the createModel method. Once we have created our model we can fit it by feeding it with data and specifying the desired degree/s of our resulting model/s.

The API is designed in a way that allows creating different models based on each degree at the same time. Instead of having a fixed estimation function, we have an internal object where we are going to store the corresponding coefficients for each degree. This way we can easily compare which degree suits best our problem by calling the estimate function specifying the different degrees.

Let’s illustrate its usage with a simple example. We are given these data points. We have them stored as an array of [x,y] values:

The 27 points (x,y) we are going to use in this example

We want to find a model that allows us to interpolate or estimate unknown values according to this information.

In this example, we are going to use a low degree (3) and a higher one (20) so we can easily visualize the differences and understand the overfitting problem mentioned in the previous sections.

The code would look as follows:

const { createModel } = require('polynomial-regression');
const { data } = require('./example_dataset');const model = createModel();model.fit(data, [3,20]);model.saveExpressions('./expressionsForGraphs.json');

Our API allows us to load/save the calculated params for each model in a JSON file. To make our lives easier, I also added an additional method that builds a string for each model-degree that we can use to plot directly function expressions in any platform like MATLAB. It saves this info in a JSON file too, like this:

{
  "3": "+1.5062596662599177*x^0+1.425302045761659*x^1 ... "
  "20": "+66.40442892021944*x^0-204.3486735913864*x^1 ... "
}

With this file, we can easily copy & paste the strings and define a function for each degree. Here is the plot for our 3-degree model:

And here is the plot for our 20-degree model:

The plot for our high degree polynomial.

The difference between the two models is obvious. The latter is “closer” to the data points, nevertheless, if we compare it with the smooth 3-degree model, it could be a good example of overfitting. It’s up to us to decide, depending on how confident we are about the amount of noise embedded in our dataset if a high degree polynomial will lead to better estimations.

Animation showing the degree increment effect

The above gif shows the fitting according to the different degrees ranging from 1 to 26. We can notice how some high degrees like 12 and 18 in our example make the model useless and distorted.

It’s also important to keep in mind the limitations while using floating-point arithmetic like the one used by JavaScript. We can try to minimize some of the rounding error’s effect by avoiding the division by small numbers, e.g. implementing a partial pivot strategy while solving the matrix equation, however, this is out of the scope of this post.

Wrapping up

That’s all! We have gone through the basics of this library and shown a brief example of its usage and what we can accomplish with it.

We have first built a system in which size depends on the degree chosen by the user.
Then we solve the system and obtain the coefficients for our model.
We store those calculated coefficients in an object that will be used by the estimate function when needed.
These three previous steps are repeated as many times as degrees are provided by the user when calling the fit method.

Resources

The full code used in this little project can be found here
The library is published and it’s available in the NPM registry here