Knowing how to fit the model when you have a curvy data set…
All models are wrong, but some are useful… George Edward Pelham
The goal of regression is to build a model to accurately predict unknown cases.
Regression is usually the process of predicting a continuous variable such as housing prices, salaries of workers, rainfall intensity E.t.c, using historical data.
Basically, there are just two types of regression, see link from IBM:-
. Simple Regression
. Multiple Regression
Both simple and multiple regression could be linear or non-linear.
The linearity of regression is based on the nature of the relationship between independent and dependent variables.
This article assumes the reader has intermediate knowledge of the concepts of simple and multiple regression. But fret not, for a refresher, check out my previous in-depth articles on simple and multiple linear regression.
We all love the sight of a scatter plot of our independent and dependent variables that shows an almost distinct straight line to fit our model, but the truth is in reality, a lot of data sets display varying patterns.
So, if the data set shows a curvy trend, then indeed a linear regression model may be unsuitable. In such situations, we need to employ a non-linear regression model. We shall see a few examples in a minute…
Non-Linear Regression (NLR):
NLR is any relationship between an independent variable X and a dependent variable y which results in a non-linear function modelled data.
Essentially any relationship that is not linear, can be termed as non-linear and is usually represented by the polynomial of k degrees (maximum power of X).
In fact, many different NLRs exist that may be used to fit whatever the data set looks like and these can go on and on to infinite degrees.
Collectively, we can safely call all of these NLRs, polynomial regression, as long as the relationship between the independent variable X and the dependent variable y is modelled as an nth degree polynomial in X. See link from IBM
So What is Polynomial Regression or Non-Linear Regression?
Polynomial regression fits a curve line to your data. A simple example of a polynomial with a degree of 3 can be shown as:-
It sure looks like a feature set for a multiple linear regression right? Just like the one below, Yes, it does. Indeed a polynomial regression is a special case of multiple linear regression, with the main idea of ‘how do you select your features?’.
Common Types of Non-Linear Regression:
Before we go on, let’s briefly look at linear regression. It is of the equation:-
y = b0 + b1x1
Linear regression models a relationship between a dependent variable y and the independent variable x. This relationship has a degree of 1.
As earlier mentioned, There are many types of non-linear regression, but perhaps the most common are:-
. Sigmoidal / Logistic
Let’s briefly look at these…
A cubic function is of the form:- y_hat is equal to intercept plus variable x raised to the third power plus x raised to the second power and so on. It could also be in reverse from 1st power to 3rd power
The graph of this function is not a straight line over the 2D plane. Let’s plot one, but first, take a look at the cubic equation below.
A quadratic function is of the equation:- y_hat is equal to variable x multiplied by variable x or raised to the power of 2.
An exponential function with base c is defined as y-hat is equal to intercept plus slope multiplied by a constant(c) which is raised to the power of variable X. See expression below.
Exponential might seem a bit confusing, but plotting it is pretty straight forward… Simply apply the numpy.exp() function and pass variable X as its argument in this form:- y_hat = np.exp(X). Then plot variable X on the x-axis and variable y on the y-axis.
In logarithmic function, y_hat is a result of applying a logarithmic map on variable X. It is one of the simplest expressions of a logarithmic function.
5. Sigmoidal / Logistic:
Logistic regression is a variation of linear regression, useful when the observed dependent variable y, is a categorical variable. It fits a special S-shaped curve by taking the linear regression and transforming the numeric estimates into a probability score, using the sigmoid function. See link
With many types of regressions to choose from, there is a good chance that one will fit your data set well.
Remember, it is important to pick a regression model that fits the data set the best.
I’m sure you have a few questions and I would generously answer what I think is the most obvious question…
How can I know if a problem is linear or non-linear in an easy way?
To answer the above question, we could do two things.
Visually figure out if the relationship is linear or non-linear. It’s best to plot bivariate plots of output variables with each input variable. See link on bivariate plots on Kaggle
Another easy option is to calculate the correlation coefficient between independent and dependent variables. This could easily be done in pandas by calling the .corr() function on the data set. If for all variables the coefficient is 0.7 or higher, there is a linear tendency and thus a non-linear model is inappropriate.
Okay, Enough Said!! Let’s get our hands dirty with some real live data…
We shall attempt to fit a non-linear model to data points corresponding to China’s GDP from 1960 to 2014. Our data set contains two columns, the first contains the years from 1960 to 2014, the second contains the corresponding Gross Domestic Product (GDP) values for each year.
This is a small data set with 55 rows and 2 columns, but it will suffice.
See link to the data set here in Github.
# Import librariesimport numpy as np
import pandas as pd
import matplotlib.pyplot as pltchina_gdp = ' https://raw.githubusercontent.com/Blackman9t/Machine_Learning/master/china_gdp.csv'df = pd.read_csv(china_gdp)df.head(10)
Next, we need to plot a bivariate graph of the data points. The independent variable X (Year) on the x-axis, and the dependent variable y (Value) on the y-axis.
X_data, y_data = (df['Year'].values, df['Value'].values)
plt.plot(X_data, y_data, 'ro')
plt.suptitle('Graph showing corresponding years and GDP values for China', y=1.02)
Hmmm… This looks kind of familiar. Can you guess which of the NLR charts we explored earlier has a similar curve as the data points above?
If you said Exponential or Logistic… you’re wrong… I’m kidding! of course, you’re right!
It sure looks like Exponential or Logistic… The GDP growth starts off slow and then from 2005 onward, the growth is very significant, and then it decelerates slightly in the 2010s.
Choosing a model:
The Logistic function could be a good approximation since it has the property of starting slow, increasing growth in the middle and then decreasing again at the end.
Building the model:
From the sigmoid equation defined above, remember that Beta_1 controls the steepness of the curve, while Beta_2 slides the curve on the x-axis.
Now let’s build our regression model and initialise its’ parameters.
def sigmoid(X, Beta_1, Beta_2):
""" This method performs the sigmoid function on param X and
Returns the outcome as a varible called y"""
y = 1 / (1 + np.exp(-Beta_1*(X-Beta_2)))
Let’s now test our sigmoid function with some sample values
beta_1 = 0.10
beta_2 = 1990.0# logistic_function
y_pred = sigmoid(X_data, beta_1, beta_2)# Plot initial predictions against data points.plt.figure(figsize=(8,5))
plt.suptitle('Sample Plot: Sigmoid Function on data points')
plt.plot(X_data, y_data, 'ro')
Normalizing our variables:
At this point, let’s normalize our variables
xdata = X_data / max(X_data)
ydata = y_data / max(y_data)
Finding the best parameters:
Our next task is to find the best parameters for the non-linear or logistic model. We shall use the curve_fit() method from scipy library. What this method does is:- It uses non-linear least squares estimate to fit the sigmoid function we defined above to the data points.
from scipy.optimize import curve_fitpopt, pcov = curve_fit(sigmoid, xdata, ydata)
# popt are our new optimized parameters
# pcov represents the covariance
print('beta_1 = %f, beta_2 = %f' % (popt,popt))>>
beta_1 = 690.453017, beta_2 = 0.997207
So now that we have the ideal parameters, thanks to the curve_fit() method, we shall use them to fit our model, in other to minimize the sum of squared differences between each prediction and its corresponding actual value.
x = np.linspace(1960, 2015, 55)
# Normalize x
x = x / max(x)
y = sigmoid(x, popt, popt)
# Plotting the original data points
plt.plot(xdata, ydata, 'ro', label='data')
# Plotting the fitted prediction line
plt.plot(x, y, linewidth=3.0, label='fit')
plt.ylabel('GDP', color='r', fontsize=18)
plt.xlabel('Year', color='r', fontsize=18)
plt.xticks(color = 'y')
plt.yticks(color = 'y')
As we can see it looks like a pretty good fit, but let’s evaluate our model…
First, let’s split the data into a training and testing data set.
msk = np.random.rand(len(df)) < 0.8
train_x = xdata[msk]
test_x = xdata[~msk]
train_y = ydata[msk]
test_y = ydata[~msk]
Next, we build the model using the training set to extract ideal params
popt, pcov = curve_fit(sigmoid, train_x, train_y)
# Remember popt saves the ideal parameters from curve_fit method
# While pcov stores the covarianceprint('Ideal params are: ', popt)
Ideal params are: [670.91888462 0.99708276]
Now, we make the predictions using the test set
y_hat = sigmoid(test_x, *popt)# *popt means unpack popt into popt and popt
mean_abs_error = np.mean(np.absolute(y_hat - test_y))
mean_squ_error = np.mean(np.absolute((y_hat - test_y) **2))print("Mean absolute error: %.2f" % mean_abs_error)
print("Residual sum of squares (MSE): %.2f" % mean_squ_error)# Next let's check the R2 score, The coefficient of determinationfrom sklearn.metrics import r2_scorer_score = r2_score(y_hat, test_y)
print("R2-score: %.2f" % r_score)>>
Mean absolute error: 0.04
Residual sum of squares (MSE): 0.00
MAE = 0.4; MSE = 0.0. ; R2-score = 0.95 (95%)
It takes a good dose of some practice, but clearly, as we’ve seen with this small data set, it is actually possible to fit a non-linear regression line through a curvy data set. Python has an abundance of modules to help us fit a model to predict a continuous or even a categorical variable.
Feel free to go through the notebook on Github for more details, especially on plotting the NLR charts we did earlier.