A tutorial on how to curve/data fit a set of data points using Least Squares Fitting in GNU Octave

Rich Yap
4 min readJul 2, 2018

--

Given a set P containing data points (x,y), how would you form a mathematical function such that it fits all data points and it models the relationship between the variables?

This problem would be a lot easier if the |P| (the size of set P) = 2 or 3 as we would’ve formed a line or quadratic equation respectively; however in real life, data sets usually contain h̶u̶n̶d̶r̶e̶d̶s̶, t̶h̶o̶u̶s̶a̶n̶d̶s̶,m̶i̶l̶l̶i̶o̶n̶s̶ or even billions of data points.

Thanks to the great minds that have formed the techniques to fit even large amount of data points, we now have MULTIPLE WAYS on our hands to fit data points and use this model in practical applications. (research more about curve fitting to know more 😉)

In this article, I’ll be showing a step-by-step procedure on how to code the method of Least Squares Fitting one of the curve fitting techniques in GNU Octave. (Here’s the Github repository link for reference)

Before diving into the coding part let’s define the
Least Squares Polynomial Curve Fitting Theorem:

Given the n data points (x₁, y₁),(x₂, y₂), …, (xₙ, yₙ), the least squares polynomial of degree m of the form

Pₘ(x) = c₁ + c₂ x +c₃ x² + … + cₘ xᵐ⁻¹ + cₘ₊₁ xᵐ

that fits the n data points is obtained by solving the following linear system

Figure 1: Linear system we’ll solve to get the function that will fit the set of data points

The matrix equation is in the form of:
Ax = b

Procedure:

  1. Install GNU Octave
sudo apt-add-repository ppa:octave/stable
sudo apt-get update
sudo apt-get install octave

2. Be familiar with the data set used. We’ll be using the population table below. Note that the independent variable and the dependent variable is the relative year (time) and population respectively.

3. Following the Least Squares Polynomial Curve Fitting Theorem, setup the corresponding linear system (matrix) of the data set.

In the code above, we build the matrices A and b corresponding to the formula given in figure 1.

Figure 2: Corresponding Matrix Setup of the data set from Figure 1 (note that n = size of | relative # of years | and m=3 <3 is arbitrary>)

Output:

Curve Fitting
1.) Matrices formed in least-squares polynomial curve fitting
A matrix:
A =
5.0000e+00 1.0000e+02 3.0000e+03 1.0000e+05
1.0000e+02 3.0000e+03 1.0000e+05 3.5400e+06
3.0000e+03 1.0000e+05 3.5400e+06 1.3000e+08
1.0000e+05 3.5400e+06 1.3000e+08 4.8900e+09
b matrix:
b =
3.1687e+02
7.8019e+03
2.4968e+05
8.6474e+06

4. Perform the Cholesky decomposition on matrix A and then solve for the x vector in figure 1 (which contains the coefficients/weights of the polynomial curve fitting the data points) through left division <lines 26 & 30>. (If solved manually, you can solve y through forward substitution and x through backward substitution)

Note: We use Cholesky decomposition because matrix A is a Hermitian, positive-definite matrix.

The equations above are the formulas in getting the x vector.

Output:

2.)The following cubic polynomial interpolates the data points:
-0.00027083*x^3 + 0.023229*x^2 + 0.95244*x^1 + 35.805

where:
a_0 = -0.000271
a_1 = 0.023229
a_2 = 0.952440
a_3 = 35.804714
SOLUTION: Cholesky LDLt
L =
1 0 0 0
20 1 0 0
600 40 1 0
20000 1540 60 1
D =Diagonal Matrix 5 0 0 0
0 1000 0 0
0 0 140000 0
0 0 0 14400000
L_transpose = 1 20 600 20000
0 1 40 1540
0 0 1 60
0 0 0 1
Ly = b // below is the solved y
y =
316.87
1464.50
977.00
-3900.00
D(Lt)x = y // below is the solved x
x =
3.5805e+01
9.5244e-01
2.3229e-02
-2.7083e-04
See the first graph containing the cubic polynomial which fits the data points
Figure 3: Graph of the cubic polynomial curve fitting the data points
Population(year) = -0.00027083*x^3 + 0.023229*x^2 + 0.95244*x^1 + 35.805

5. [Optional] Now that we have generated the curve that fits the data points, let’s use this to compare the interpolated population from the years 1975, 1985, 1995, & 2005 (relative years from 1970 = [ 5,15,25,35]) to the real values as seen in the table below.

Link to code : Check lines 108–112

Output:

3.) Interpolation the population in years(5,15,25,35) using the model VS real value:
interpolated_values =
41.114 54.404 69.902 85.983real_value =41.290 54.320 69.830 86.270See the second graph containing the cubic polynomial interpolating the data points
Figure 4: Graph of the polynomial curve with the extra data points for years (1975, 1985, 1995, & 2005)

Observe that the error/difference between the interpolated values using the model and the real value is a matter of decimal point. This is only the case here as we have a particularly small data set (which has data points that aren’t that erratic) and note that this may not apply to other data set.

Take a look at this graph I generated when I used least square fitting to another data set:

Here’s the end of the tutorial. 👌
Thank you!

--

--

Rich Yap

Data Engineer at a Philippines data consulting startup | UPD BS CS 🌻| @richdayandnight on Github and Twitter