What are GAMs?

Saurav Jadhav
Analytics Vidhya
Published in
4 min readJun 25, 2020

Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing nonlinear functions of each of the variables, while maintaining additivity. Let’s see what exactly that means,

Linear models are simple to describe and implement and have advantage over other approaches in terms of interpretation and inference. But they have limitations in prediction power, that is, how accurately we can predict the output. Suppose we have data which consist of input of P features (X1, X2,….., Xp), and a output Y. Therefore, the corresponding linear model (also known as multi linear regression model) to predict the output:

Y = β0 + β1X1 + β2X2 +···+ βpXp + Ɛ

Where β0, β1,….,βp are parameters of the equation and Ɛ is the irreducible error , in order to allow for non-linear relationships between each feature and the response(output) is to replace each linear component βjXj with a (smooth) nonlinear function fj(Xj) which corresponds to the jth feature . We would then write the model as

Y = β0 + f1(X1) + f2(X2) + f3(X3) +…..+ fp(Xp)+Ɛ

This is an example of a GAM. It is called an additive model because we calculate a separate fj for each Xj, and then add together all of their contributions. Now the question is how to find this nonlinear function? It turns out there are various methods, but we will specifically be looking at Natural Splines for below example:

Wage = β0 + f1(year)+f2(age)+f3(education)+ Ɛ — — — — — -(1)

Before discussion on natural splines it is worth noting that the relationship which exist in real world data is often nonlinear, and a lot of time very complex, that is, even a standard nonlinear function will not prove to be a good approximation of the relation. Now, natural splines are piece-wise degree ‘d’ polynomials whose first ‘d-1’ derivatives are continuous with additional boundary constraints , Instead of fitting a high-degree polynomial over the entire range of feature space, piece-wise polynomial regression involves fitting separate low-degree polynomials, to be concrete, in the equation (1) we are predicting wage on the basis of years, age and education. Here we are independently fitting the functions keeping other features constant, that is, prediction of ‘wage’ on the basis of ‘age’ keeping ‘year’ and ‘education’ constant, Now we know as the ‘age’ increases ‘wages’ increases but after retirement the wages fall, that means up to a certain ‘age’ the relationship is increasing and after which it is decreasing therefore, we fit a polynomial until say age 60 which gives increasing relationship and then after 60, another polynomial to capture decreasing relationship, so it unable us to be flexibly extract relationship between feature and the response. The constraints(continuity of derivatives) unable us to smoothly join these two polynomials.

Now coming back to GAMs, here ‘year’ and ‘age’ are quantitative variables, and ‘education’ is a qualitative variable with five levels: <HS, HS, <Coll, Coll ,>Coll, referring to the amount of high school or college education that an individual has completed. We fit the first two functions using natural splines. We fit the third function using a separate constant for each level, via the dummy variable approach (for each level of education we create a separate feature with binary value 0 or 1, for example, in case person has high school (HS) as education, ‘HS’ will be 1 and for every other feature of levels it will be 0. )

Source:An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

Figure 1 shows the results of fitting the model using least squares to predict wages on the basis of ‘years’ keeping age and education constant. Wage tends to increase slightly with year; this may be due to inflation.

Source:An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

Figure 2 indicates that holding education and year fixed, wage tends to be highest for intermediate values of age, and lowest for the very young and very old.

Source:An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

Figure 3 indicates that holding year and age fixed, wage tends to increase with education: the more educated a person is, the higher their salary, on average.

The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed.

Reference: An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

--

--

Saurav Jadhav
Analytics Vidhya

Data Scientist| Master's Degree Statistics| Owner of Club Linguistics, a Publication on Medium