When the word regression comes, we are able to recall only linear and logistic regression. These two regressions are most popular models, although there are different types of regression models which are helpful and use-full in different cases.
Let’s discuss today the Spline Regression using R. Before going further in regression you should have basic understanding of Spline.
Spline is a special function defined piece-wise by polynomials. The term “spline” is used to refer to a wide class of functions that are used in applications requiring data interpolation and/or smoothing. The data may be either one-dimensional or multi-dimensional.
Spline Regression is one of the non-parametric regression technique. In this technique the dataset is divided into bins at intervals or points which we called as knots. Also this bin has its separate fit. We will be going through simple example to understand the implementation of Spline regression using R.
I have created a dummy data as shown bellow:
Let’s plot our data as and the plot is as shown below:
Now, we will fit a simple linear regression on our data and see how it works. The equation of line is:
Let’s fit regression line to our model:
We can see that our model is terribly fitted on our data, also the R-squared and Adjusted R-squared values are very poor.
Let’s now try polynomial regression with degree 2 and analyze the performance of the model. So, the equation of line becomes:
We can see that it’s not a bad fit but not a great one either. Polynomial regression also comes with various disadvantages that it tends to overfit. It may lead to increase in complexity as the number of features increases.
The disadvantages of the polynomial regression can be overcome by using Spline Regression. Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines.
Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots.
In case of spline regression we divide datasets into bins. For our case let us divide the dataset into two bins. So, let’s say at around Day = 90. One on the left side of the peak that occurs at Day = 90 and the other at its right side. Now let’s plot accordingly.
Now we need to merge the above two images into one equation and perform spline regression or piece-wise regression using step function. Let’s see how it’s done. For that firstly we need to create a new categorical variable such that if Day > 90 then 1 else 0.
Let’s fit the model on this data.
We can see that the R-squared and Adjusted R-squared value has increased much. Now let’s see how our regression line fits.
Now, let’s introduce spline package in R which includes the function bs for creating b-spline term in a regression model. I will introduce a new example using the Ecdat package and the Clothing dataset.
Now let’s fit our model. We have to indicate the number and placement of knots. The most commonly use down is at 25th 50th and 75th percentile.
We have use lm() function to build simple linear model. However, we also used the bs() function which allows us to create our spline regression model.
Remember that the default spline model in R is a third-degree polynomial. This is because it is hard for the eye to detect the discontinuity at the knots.
We are now ready to plot our model. The code below graphs the model and includes the regression line (red), confidence interval (green), as well as the location of each knot (blue).
How to choose the number and locations of the knots ?
When we fit a spline, where should we place the knots? The regression
spline is most flexible in regions that contain a lot of knots, because in
those regions the polynomial coefficients can change rapidly.
Hence, one option is to place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable.
While this option can work well, in practice it is common to place knots in
a uniform fashion.
One way to do this is to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data.
How many degrees of freedom should our spline contain ?
One option is to try out different numbers of knots and see which produces the best looking curve. A somewhat more objective approach is to use cross-validation.
With this method, we remove a portion of the data (say 10 %), fit a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out portion.
We repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RSS ( residual sum of squares). This procedure can be repeated for different numbers of knots K. Then the value of K giving the smallest RSS is chosen.
Thus, I would like to conclude this article. In this blog we are able to understand spline regression and it’s working. There is lot more to study in Spline Regression such as Smoothing Splines, Cubic Spline, etc. Let’s see these all in my next blog.
Hope you guys were able to understand and able to grab the idea of Spline Regression.