Understanding Non-Linear Regression

Lawrence Alaso Krukrubo
Nov 6 · 9 min read

Knowing how to fit the model when you have a curvy data set…

All models are wrong, but some are useful… George Edward Pelham

The goal of regression is to build a model to accurately predict unknown cases.

Regression is usually the process of predicting a continuous variable such as housing prices, salaries of workers, rainfall intensity E.t.c, using historical data.

Understanding Non-linear Regression

Basically, there are just two types of regression, see link from IBM:-

. Simple Regression

. Multiple Regression

Both simple and multiple regression could be linear or non-linear.

The linearity of regression is based on the nature of the relationship between independent and dependent variables.

This article assumes the reader has intermediate knowledge of the concepts of simple and multiple regression. But fret not, for a refresher, check out my previous in-depth articles on simple and multiple linear regression.

We all love the sight of a scatter plot of our independent and dependent variables that shows an almost distinct straight line to fit our model, but the truth is in reality, a lot of data sets display varying patterns.

So, if the data set shows a curvy trend, then indeed a linear regression model may be unsuitable. In such situations, we need to employ a non-linear regression model. We shall see a few examples in a minute…

img_credit

Non-Linear Regression (NLR):

Essentially any relationship that is not linear, can be termed as non-linear and is usually represented by the polynomial of k degrees (maximum power of X).

In fact, many different NLRs exist that may be used to fit whatever the data set looks like and these can go on and on to infinite degrees.

Collectively, we can safely call all of these NLRs, polynomial regression, as long as the relationship between the independent variable X and the dependent variable y is modelled as an nth degree polynomial in X. See link from IBM

So What is Polynomial Regression or Non-Linear Regression?

where b0 is the intercept or bias unit and b1 to b3 are the slopes of each independent value of variable x.

It sure looks like a feature set for a multiple linear regression right? Just like the one below, Yes, it does. Indeed a polynomial regression is a special case of multiple linear regression, with the main idea of ‘how do you select your features?’.

where b0 is the intercept or bias unit and b1 to b3 are the slopes of each independent variable x1 to x3

Common Types of Non-Linear Regression:

y = b0 + b1x1

Linear regression models a relationship between a dependent variable y and the independent variable x. This relationship has a degree of 1.

Sample Linear Regression Chart

As earlier mentioned, There are many types of non-linear regression, but perhaps the most common are:-

. Cubic

. Quadratic

. Exponential

. Logarithmic

. Sigmoidal / Logistic

Let’s briefly look at these…

img_credit

1. Cubic:

The graph of this function is not a straight line over the 2D plane. Let’s plot one, but first, take a look at the cubic equation below.

y_hat = intercept + x raised to power 3 + x raised to power 2 + x …
Sample Cubic Regression Chart

2. Quadratic:

y_hat = X squared
Sample Quadratic Regression Chart

3. Exponential:

where b != 0, c > 0 != 1, x is a variable and a real number and c is also a constant.

Exponential might seem a bit confusing, but plotting it is pretty straight forward… Simply apply the numpy.exp() function and pass variable X as its argument in this form:- y_hat = np.exp(X). Then plot variable X on the x-axis and variable y on the y-axis.

Sample Exponential Regression Chart

4. Logarithmic:

y_hat = log of X
Sample Logarithmic Regression Chart

5. Sigmoidal / Logistic:

β1 controls the curves steepness, β2 controls the curve on the x-axis.
Sample Logistic Regression Chart

With many types of regressions to choose from, there is a good chance that one will fit your data set well.

Remember, it is important to pick a regression model that fits the data set the best.

I’m sure you have a few questions and I would generously answer what I think is the most obvious question…

Question:

How can I know if a problem is linear or non-linear in an easy way?

To answer the above question, we could do two things.

A.

Visually figure out if the relationship is linear or non-linear. It’s best to plot bivariate plots of output variables with each input variable. See link on bivariate plots on Kaggle

B.

Another easy option is to calculate the correlation coefficient between independent and dependent variables. This could easily be done in pandas by calling the .corr() function on the data set. If for all variables the coefficient is 0.7 or higher, there is a linear tendency and thus a non-linear model is inappropriate.

Okay, Enough Said!! Let’s get our hands dirty with some real live data…

img_credit

We shall attempt to fit a non-linear model to data points corresponding to China’s GDP from 1960 to 2014. Our data set contains two columns, the first contains the years from 1960 to 2014, the second contains the corresponding Gross Domestic Product (GDP) values for each year.

This is a small data set with 55 rows and 2 columns, but it will suffice.

See link to the data set here in Github.

# Import librariesimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
china_gdp = ' https://raw.githubusercontent.com/Blackman9t/Machine_Learning/master/china_gdp.csv'df = pd.read_csv(china_gdp)df.head(10)
Displaying the first 10 rows…

Next, we need to plot a bivariate graph of the data points. The independent variable X (Year) on the x-axis, and the dependent variable y (Value) on the y-axis.

plt.figure(figsize=(8,5))
X_data, y_data = (df['Year'].values, df['Value'].values)
plt.plot(X_data, y_data, 'ro')
plt.suptitle('Graph showing corresponding years and GDP values for China', y=1.02)

plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

Hmmm… This looks kind of familiar. Can you guess which of the NLR charts we explored earlier has a similar curve as the data points above?

If you said Exponential or Logistic… you’re wrong… I’m kidding! of course, you’re right!

It sure looks like Exponential or Logistic… The GDP growth starts off slow and then from 2005 onward, the growth is very significant, and then it decelerates slightly in the 2010s.

Choosing a model:

Building the model:

Now let’s build our regression model and initialise its’ parameters.

def sigmoid(X, Beta_1, Beta_2):
""" This method performs the sigmoid function on param X and
Returns the outcome as a varible called y
"""
y = 1 / (1 + np.exp(-Beta_1*(X-Beta_2)))
return y

Let’s now test our sigmoid function with some sample values

beta_1 = 0.10
beta_2 = 1990.0
# logistic_function
y_pred = sigmoid(X_data, beta_1, beta_2)
# Plot initial predictions against data points.plt.figure(figsize=(8,5))
plt.suptitle('Sample Plot: Sigmoid Function on data points')
plt.plot(X_data, y_pred*15000000000000.0)
plt.plot(X_data, y_data, 'ro')
plt.show()
The blue line is our sample sigmoid model the red dots are the data points.

Normalizing our variables:

xdata = X_data / max(X_data)
ydata = y_data / max(y_data)

Finding the best parameters:

from scipy.optimize import curve_fitpopt, pcov = curve_fit(sigmoid, xdata, ydata)
# popt are our new optimized parameters
# pcov represents the covariance
print('beta_1 = %f, beta_2 = %f' % (popt[0],popt[1]))
>>
beta_1 = 690.453017, beta_2 = 0.997207

So now that we have the ideal parameters, thanks to the curve_fit() method, we shall use them to fit our model, in other to minimize the sum of squared differences between each prediction and its corresponding actual value.

x = np.linspace(1960, 2015, 55)
# Normalize x
x = x / max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, popt[0], popt[1])

# Plotting the original data points
plt.plot(xdata, ydata, 'ro', label='data')
# Plotting the fitted prediction line
plt.plot(x, y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP', color='r', fontsize=18)
plt.xlabel('Year', color='r', fontsize=18)
plt.xticks(color = 'y')
plt.yticks(color = 'y')
plt.show()
The fitted model blue line on the curvy data points…

As we can see it looks like a pretty good fit, but let’s evaluate our model…

Model evaluation…

First, let’s split the data into a training and testing data set.

msk = np.random.rand(len(df)) < 0.8 
train_x = xdata[msk]
test_x = xdata[~msk]
train_y = ydata[msk]
test_y = ydata[~msk]

Next, we build the model using the training set to extract ideal params

popt, pcov = curve_fit(sigmoid, train_x, train_y)
# Remember popt saves the ideal parameters from curve_fit method
# While pcov stores the covariance
print('Ideal params are: ', popt)
>>
Ideal params are: [670.91888462 0.99708276]

Now, we make the predictions using the test set

y_hat = sigmoid(test_x, *popt)# *popt means unpack popt into popt[0] and popt[1]

Evaluation…

mean_abs_error = np.mean(np.absolute(y_hat - test_y))
mean_squ_error = np.mean(np.absolute((y_hat - test_y) **2))
print("Mean absolute error: %.2f" % mean_abs_error)
print("Residual sum of squares (MSE): %.2f" % mean_squ_error)
# Next let's check the R2 score, The coefficient of determinationfrom sklearn.metrics import r2_scorer_score = r2_score(y_hat, test_y)
print("R2-score: %.2f" % r_score)
>>
Mean absolute error: 0.04
Residual sum of squares (MSE): 0.00
R2-score: 0.95

MAE = 0.4; MSE = 0.0. ; R2-score = 0.95 (95%)

Summary:

Feel free to go through the notebook on Github for more details, especially on plotting the NLR charts we did earlier.

Cheers!

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Lawrence Alaso Krukrubo

Written by

A lover of wildlife, family and billiards… Data Science|Machine Learning|Deep Learning|Writer@towardsAI|Sales|Project_Mgt|…stupidly curious.

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade