Image Source: Author

Non-Linear Regression Analysis

Samyak Kala
Analytics Vidhya
Published in
5 min readMar 2, 2021

--

Introduction

If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data is linear. Let’s learn about non-linear regressions and apply an example in python. In this notebook, we fit a non-linear model to the data points corresponding to China’s GDP from 1960 to 2014.

Importing required libraries

# Basic Data Science libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and an independent variable x. It had a simple equation, of degree 1, for example y = 2𝑥 + 3.

x = np.arange(-5.0, 5.0, 0.1)
#You can adjust the slope and intercept to verify the changes in the graph.
y = 2*(x) + 3
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.figure(figsize=(8,6))
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

Non-linear regressions are a relationship between independent variables 𝑥 and a dependent variable 𝑦 which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear and is usually represented by the polynomial of 𝑘 degrees (maximum power of 𝑥).

𝑦 =𝑎x³ +𝑏𝑥²+𝑐𝑥+𝑑

Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example:

𝑦=log(𝑥)

Or even, more complicated such as :

𝑦=log(𝑎𝑥³+𝑏𝑥²+𝑐𝑥+𝑑)

Let’s take a look at a cubic function’s graph.

x = np.arange(-5.0, 5.0, 0.1)
#You can adjust the slope and intercept to verify the changes in the graph
y = 1*(x**3) + 1*(x**2) + 1*x + 3
y_noise = 20 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

As you can see, this function has 𝑥³ and 𝑥² as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function.

Non-Linear Regression example

For an example, we’re going to try and fit a non-linear model to the datapoints corresponding to China’s GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China’s corresponding annual gross domestic income in US dollars for that year.

import numpy as np
import pandas as pd
#downloading dataset
!wget -nv -O china_gdp.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv

df = pd.read_csv("china_gdp.csv")
df.head(10)

Plotting the Dataset

This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it decelerate slightly in the 2010s.

plt.figure(figsize=(8,5))
x_data, y_data = (df["Year"].values, df["Value"].values)
plt.plot(x_data, y_data, 'ro')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

Choosing a model

From an initial look at the plot, we determine that the logistic function could be a good approximation, since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:

X = np.arange(-5.0, 5.0, 0.1)
Y = 1.0 / (1.0 + np.exp(-X))
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()

Building The Model

Now, let’s build our regression model and initialize its parameters.

def sigmoid(x, Beta_1, Beta_2):
y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
return y
beta_1 = 0.10
beta_2 = 1990.0
#logistic function
Y_pred = sigmoid(x_data, beta_1 , beta_2)
#plot initial prediction against datapoints
plt.plot(x_data, Y_pred*15000000000000.)
plt.plot(x_data, y_data, 'ro'

Our task here is to find the best parameters for our model. Lets first normalize our x and y:

How we find the best parameters for our fit line?

we can use curve_fit which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, *popt) - ydata is minimized. popt are our optimized parameters.

# Lets normalize our data
xdata =x_data/max(x_data)
ydata =y_data/max(y_data)
from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, xdata, ydata)
# Now we plot our resulting regression model.
x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()

Calculating Accuracy of our model?

# split data into train/test
msk = np.random.rand(len(df)) < 0.8
train_x = xdata[msk]
test_x = xdata[~msk]
train_y = ydata[msk]
test_y = ydata[~msk]
# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)
# predict using test set
y_hat = sigmoid(test_x, *popt)
# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - test_y) ** 2))
from sklearn.metrics import r2_score
print("R2-score: %.2f" % r2_score(y_hat , test_y)

Mean absolute error: 0.05
Residual sum of squares (MSE): 0.00
R2-score: 0.95

Thanks for Reading

For more such content click here and follow me.

--

--

Samyak Kala
Analytics Vidhya

A Machine Learning enthusiast, a python developer, focusing on Deep Learning and NLP