# Introduction

I am a believer in Linear Algebra and always fascinated with its powerful application. Linear Regression is one of the first applications of Linear Algebra that I’ve learned. Simply put, Linear Regression is a process of generalizing the linear relationship between a scalar variable (dependent~y) and one or more explanatory variables (independent~x) by minimalizing the distance between the data and the regression function. a.k.a. the line of best fit in Simple Linear Regression.

# The Problem

But….the world 🌎 is a funny place, the pure linear relationship rarely exists and things like to correlate with each other somehow. Therefore the linear relationship and multicollinearity assumptions for a successful application of Linear Regression are violated most of the time. R², MSE (Mean Squared Error) are quoted often as an indicator of how good the regression model is. But these metrics like many others in data modeling, are not perfect. (Looking at you - Correlation, ROC, p-value etc.)

Here is a famous illustration — The Anscombe’s Quartet.

The four sets of data look very different yet all the descriptive statistics are the same and the “line of best fit” behaves exactly the same as well. Here the correlation between x and y is 0.816, and the regression line is y = 3.0+0.5x and is 0.67. But looking closely at the plots, it is clear that except for Case 1, does the straight line of “best fit” makes sense for the other cases?

• Case 1: a simple linear relationship between x1 and y1 is probably the best assumption and can be modeled easily.
• Case 2: Data looks like a part of an upside-down parabola, hence x2 and y2 definitely not a linear relationship.
• Case 3: There seems to be an outlier acting as a very high-leverage point to the regression line. i.e. the slope of the line is very sensitive to this point, but simply excluding this outlier would yield a perfect line.
• Case 4: Another type of a high-leverage point, different from Case 3, excluding this point won’t result in a perfect line as y4 has different values yet x4 remains at 8. i.e. there is almost no relationship between x4 and y4

Overall, Case 3 and 4 need a better understanding of the data before deciding if Linear Regression a good choice to model these data. But for Case 2, a “Linear Line” may still be possible.

There are many simple tricks that can apply to the data when they are not so linear like the log transformation. But most cases can be generalized under the topic of Curve fitting when dealing with Non-Linear data. When the data doesn’t have random noise, the problem of finding a best-fit curve becomes an interpolation problem.

# The Idea

Curve fitting is the process of constructing a curve — a mathematical function that has the best fit to a series of data points. Curve fitting often used in interpolation, where random samples were drawn from an unknown function and finding the best curve that passes through every sample point.

Also in engineering in signal processing or smoothing where some random noise was introduced or the data were generated by some underlying function where a simpler “smooth” function is constructed that it can approximately fit the data and make sense of the trend or characteristics of the signal recorded.

This is a Polynomial function, it has the form:

The coefficients aₖ are the constant and n denoted the highest degree of the polynomial.

when k=0 (constant):

when k=1 (a straight line just like y = mx + b):

when k=2 (degree 2 polynomial a.k.a. parabola or quartic function):

then when k=n (n degree polynomial):

How may this relevant you may ask, let’s see the example below? Let’s generate random points from the function:

How can one possibly find a “linear” line to fit this curve? Here is little math known as the polynomial expansion:

We can see that this is a degree-3 polynomial since the highest degree term is 3. But once you look closely, isn’t this look exactly like a regression line with 3 independent variables?

In this case

Here we can see that each independent variable (x) corresponds to each polynomial degree term xᵏ.

With this in mind, then maybe we can just create new independent variables based on the x coordinates of the sample points by calculating x, x², x³…. and add to the Linear Regression Model instead of 1 independent variable, we now have 3.

When we don’t know the underlying function of the data, we can slowly add higher degree terms and reassess the fit, but this is almost as good as guessing and assuming the data came from a polynomial. Also, there isn’t a limit on the degree of a polynomial, therefore it will take us infinite time to test every single one and as the higher the degree goes, the polynomial behaves pretty radically which leads to overfitting the data (explains later).

# In Practice

To illustrate this, I build this visualizing tool using R Shiny. Let’s try fitting the curve using a degree-0 Polynomial (i.e. a constant):

• Degree-0 (y=a)

As expected, the best line that fits the data is y=78.4 which is essentially the average of the sample points. Due to the symmetry of the original data, the constant line does match well for the values that are near the zero of the polynomial but poor for the two tail ends.

• Degree-1 (y=mx+b)

Ah! Linear Regression will try its best to fit this non-linear data with a straight line. It would be ok if the leading coefficient of the highest degree term is “big” then a degree-3 polynomial can act like a straight line. But unfortunately not in this case.

The regression line is
y=104.8x-484.8

• Degree-2 (y=ax²+bx+c)

Hmmm…it looks like a slightly bent line. The parabola is given by y=2.08x²+81.8x-477.324 and the coefficient of the degree-1 term (81.8) is much larger than the coefficient of term (2.08). i.e. when x changes, the degree-1 term has more weight than , resulting in a very wide parabola.

• Degree-3 (y=ax³+bx²+cx+d)

Not a surprise, when we fit a degree-3 polynomial it will fit the data perfectly. The degree-3 polynomial is y=2x³-30x²+150x–250 which is what we expected when we expand y=2(x-5)³. Obviously, we won’t know the underlying function usually.

You are lucky that it was a polynomial! What about sine, cosine, log, eˣ and your example is too perfect!

Surprisingly, a degree-n polynomial can approximate these functions very well. Imagine this, the Sine curve is just a bunch of positive and negatives parabolas or a very high degree polynomials!

But careful now, this method can fit your data well if done right, but the best-fit curve does nothing more than fitting or interpolate the data. It would be terrible in making predictions or make sense of the real relationship between the variables in some cases.

For example, if you are dealing with data that has an unknown periodic behavior like changes of temperature in a day or tidal waves etc., we are more interested in knowing the period and the maximum amplitude. Fitting a polynomial won’t give you the correct prediction when you go beyond the period training data is used especially for periodical data.

But who said that we can’t fit a “linear” line with “variables” made up using all the known functions? Like

This is actually known as a linear transformation of the independent variable (x). Introducing linear transformed independent variables to improve the linear relationship between the x and y.

This leads to many other curve fitting techniques like using Series Expansion like Fourier Series, Taylor Series with a definite number of terms.

For fun let’s see how polynomial can fit a more complicated curve.

Above is the process of curve fitting of the function x sin(3x) using a range of degree-0 to degree-50 polynomials. You can see how quickly it approximate the training data. Adding a higher degree term in the polynomial, the number of bends increases as well hence given a high enough degree polynomial, it can mimic the periodic pattern of the sine curve. But like mentioned above, it can only be used to fit the given training data and will not able to provide a valid forecast.

# Drawbacks

• Performance:

In order to curve fit using a degree-n polynomial, you need to calculate the input data vector from degree-0 up to degree-n. (i.e. given the input vector x you will need to calculate x⁰,x¹,…,xⁿ.) Hence, depends on your data and the degree-n polynomial, the training data can become big very fast, and the values inside can be huge. (e.g. 5²⁰ = 9.5367432e+13)

For data interpolation (low variance), there are much easier methods for curve fitting using polynomial without solving the inverse of a giant matrix. (e.g. Neville’s algorithm). But when you have imperfect data (due to sample variance), there are other advanced ML methods that can be used as well.

• Over-Fitting:

High degree polynomial does provide more “flexibility” in terms of the number of “bends” it provides. Like:

This is a degree-20 polynomial used to fit x sin(3x). This property of polynomial allows us to approximate sine waves within the defined range.

It also can cause over-fitting. Here is another example to illustrate this point:

The data is generated from a simple x² curve with random noise added. Comparing curve fitting using a degree-2 polynomial (Red) and a degree-50 polynomial (blue). The MSE of the blue curve is 2398.269 and the MSE of the red curve is 1988.317.

The degree-50 polynomial provides extra bends that wiggle in space and is able to reach more points where a lower degree polynomial can not, therefore it provides better MSE but then the resulting model is also way more complicated and over-fitting the data set.

One quick method to see if you may over-fitting your data is to look at the changes in the MSE each time you add a higher degree term for fitting.

We can see MSE shrink as higher degree term added. Despite keep adding higher degree terms, there are times MSE does really change at all. Here we can use the elbow method which tells us when the polynomial degree is 2, the MSE reduced the most. This is obvious given the data is generated from a degree-2 polynomial. Hence instead of going for a degree-50, we may be as well using a degree-2 and sacrifice some accuracy.

# Conclusion

In conclusion, the above example actually not generated from a simple x², it is actually a tiny part of a sine curve, it is actually from an x²⁰⁰⁰⁰ polynomial, who knows anyway. The point is we find a robust way to model our data and if we are willing to make some sacrifices, we may understand the world around us better.

All models are wrong, but some are useful” — many statisticians.

Also when in doubt, plot them out!

# P.S.

The visualization tool is available for you to play at:

Github:

https://github.com/stevenlio88/Polyfit

Or in R:

`install.packages(“shiny”)library(shiny)runGitHub("Polyfit","stevenlio88")`

--

--

--

## More from Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

## Nature Inspired Optimization Algorithms — Part 2 : Ant Colony Optimization ## Graph Theory and Data Science  ## Kurt Gödel, the Mathematician Who Spoiled David Hilbert’s Life Work! ## How Mathematics can aid in controlling epidemics ## Euler Number ## Legendary Nonsense  ## Smokin’ Bayes Meets Logistic Regression ## Simulation-based linear mixed effect regression models with stan ## How use one-way ANOVA for forecasting in R ## Checking the Relation Between Various Attributes From Heart Failure Data 