How To Select The Best Possible Statistical Model For Given Dataset?

Girish Bhide
Analytics Vidhya
Published in
10 min readOct 9, 2020

--

This article is going to be a little bit long but I will try to make it simple and interesting. So the topic of this article is a selection of the best statistical model for dataset analysis, but first of all what is a statistical model?

“A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process.”

These scientific definitions are always so complicated😪. In simple words, a statistical model is an equation that shows the relation between dependent and independent variables. Depending on the complexity of your problem this equation can get complex sometimes.

I guess almost all of us face this kind of situation when we go out for dinner and get confused about what to select from so many delicious dishes 😋

Topics for this article –

Multiple Linear Regression

ANOVA (One-Way and Two-Way)

Generalized Linear Model

Model Selection Parameters (AIC and BIC)

Data Exploration –

While exploring your dataset first thing you have to identify what is the data type of the dependent variable and independent variables. Whether it is continuous or a categorical variable. After that check for what type of distribution they are following, are they normally distributed or not. It’s because some models assume that your data is normally distributed. But if your data is not normally distributed or does not follow “Gaussian distribution”, the outcomes of analysis will differ from what you expected.

Plotting plays a super important role in the analysis. Scatter plots give us a clear idea about what kind of relationship our variables have and is it a good idea to go for linear regression or choose a different model which suits the data.

Let’s take a look at Linear Statistical Models –

Multiple Linear Regression –

I have discussed the simple linear regression in my previous article. You can find that here or here if you have missed it.

Unlike simple linear regression, multiple regression has more than one independent variable. All the assumptions made in simple linear regression are the same for multiple linear regression. Multiple linear regression is preferred when our data has continuous variables. (If you have some variables with factors then use “one-hot encoding” to convert them!)

As the number of independent variables increases, equations also increase. So it’s important to select variables that predict your dependent variable more accurately. But how do we know which variables are predicting better than others? Well here is when the term “Adjusted R squared” comes into play. In simple linear regression adjusted R squared does not play an important role because we got only one independent variable in that model.

As shown in the above figure we have 8 independent variables marked in blue and our target variable in green. So, the equation for multiple linear regression looks like this…

Y = β0 + β1X1 + β2X2 + … +βnXn + ε

In simple linear regression R squared value is the parameter to see whether our model is explaining all the variance or not, it’s because we have only one independent variable. But that is not the case with multiple variables. The value of R squared increases with the addition of every new variable. Wow! That’s great, isn’t it? Unfortunately, the answer is no. If you look at the equation for R squared which is –

R²= 1 — (SSreg/SStotal)

Again, complicated notations😵…. Let me show you the simple meaning of this

R² = 1 — (Var(model residuals)/Var(target))

Now, as we keep on adding variables, the variance of residuals goes down and R squared value increases. The above image shows how R squared increases. In each linear model (lm), I have added new variables. Some of them are significant and some are not. If you take a look at our dataset there are variables that have nothing or negligible relation with our target variable. But still, when we add them to our model our R squared increases. Eventually, this leads to the “over-fitting” of the model.

(psst… do not forget to scale your data 🤓)

To overcome this problem we use “adjusted R squared”. In a modified R squared the number of total observations as well as the number of independent variables are taken into consideration. So using this metric we can clearly check which parameters are worth keeping in the model. You can see in the following code, adjusted R squared increases up to a certain amount. But when we add variables that do not help in predicting the dependent it starts to fall down.

There is another approach called “Bayesian Inference”. But that is a whole another level of statistical modeling. The Bayesian approach is based on probability calculations. Till now we have assumed that β and σ square were unknown fixed constants, but the Bayesian approach additionally allows probability distributions to represent hypothetical uncertainty. Thus β and σ square can be treated as if they are random variables because we are uncertain about their values.

It’s like Avengers and X-Men both belong to Marvel Cinematic Universe but their paths are different. 😛

I will make another article on Bayesian statistics some other day…

Moving on to our next item from the menu card…

ANOVA (ANalysis Of VAriance) –

Whenever we want to compare the performances of two or more groups, we do it with the help of ANOVA. But exactly which type of ANOVA we should use? One-way model, two-way model, a one-way model with balance, a two-way model with balance, cell means model for unbalanced data, two way model with empty cells? (This list is like Bubba telling his shrimp recipes to Forrest😁)

It is not possible to cover all the methods in ANOVA in one article, I will try to give a basic idea about them. So, let’s start with one-way ANOVA…

One-Way ANOVA –

Let us say a company has invented two new air filters for their cars. Now the company is interested to see how newly fitted air filters affect the fuel economy. To begin with, cars that have old air filters give a fuel economy of µ km per litre. Then if air filter 1 fitted, the fuel economy is expected to increase by τ1 km per litre, and if air filter 2 is fitted fuel economy would increase by τ2 km per litre. Then the model could be expressed as

Y1 = µ + τ1 + ε1, y2= µ + τ2 + ε2

Where Y1 is fuel economy with air filter1 and ε1 is a random error term. Same for Y2 and ε2 when air filter 2 is fitted. The company can now find out the parameters µ, τ1, τ2, and test the hypothesis for difference in two air filters.

In R you can perform one way ANOVA with the function “aov()” or you can refer my code in the image below. Also, you can create a box plot to see variations in observations like one I did below.

Two-Way ANOVA –

After researching air filter observations, the company also wants to see how they perform when the car is moving on elevations. Now, we have two independent variables one is the air filter and another is elevation. As we have two independent variables and we want to see their combined effect, we can go for two-way ANOVA. Everything is the same as one-way just a few parameters get added in the equation. In a simple way, the equation can be written as –

Y = µ + α + β + γ + ε

Where µ is the grand mean, α is the effect of the filter, β is the effect of elevation, γ is the interaction of two parameters and ε is the error variable. There are also indexing parameters like i, j, k which are the level of the first factor, level of the second factor, and replication within the group respectively. (I have not added them into the equation. I will talk about it in detail in an article dedicated to ANOVA.)

After running the model in R we can see the output something like this

Let’s see the last item for the linear statistical model…

Generalized Linear Model –

Generalized linear models include the classical linear regression and ANOVA models as well as logistic regression and some non-parametric models. GLMs have different assumptions than the assumptions we saw earlier. Some of those assumptions are –

Cases are independently distributed.

The dependent variable Y does not need to be normally distributed.

It does not assume that dependent and independent variables have linear relationships.

Independent variables can be nonlinear transformations of the original independent variable.

The homogeneity of variance does not need to be satisfied.

And errors need to be independent but not normally distributed.

These assumptions give a lot more flexibility while generating a model for the problem.

Generalized linear models (GLMs) are regression models, and so consist of a random component and a systematic component. The random and systematic components take specific forms for GLMs (Yes, GLMs are shapeshifters of statistics👻), which depend on the answers to the following questions -

What probability distribution is appropriate? The answer determines the random component of the model. The choice of probability distribution may be suggested by the dependent variable.

How are the independent variables related to the mean of the dependent variable? The answer suggests the systematic component of the model.

The equation can be written as –

η = β0 + β1X1 + β2X2 + … + βnXn

And functions are –

E(Y) = μ and g(μ) = η

Var(Y) = φ * Var(μ)

Functions transform the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded.

These equations may vary with the type of distributions.

By now you have got an overview of major linear statistical models. Now, we will see how to compare and select the best possible model for the dataset. There are times when you have multiple models built for the same dataset. But which model is performing better than others? The final topic of discussion will give you a clear idea about it.

Model Selection Parameters (AIC and BIC) –

AIC (Akaike’s Information Criterion) –

Why it is called an information criterion? The answer is simple because it evaluates the “information loss” of a model. Great! But again, what is information loss? Information loss is an estimate of how good or bad a prediction is. If a model predicts perfect values of the dependent variable then we can say that there is no information loss. So let us look at its equation and know it better…

AIC = -2(log-likelihood) + 2K

To start with the equation, let’s see what “Likelihood” is.

In the above figure for example we have data of n people whose weight is recorded. And we want to see the likelihood of distribution when weight is given as 85 kg. In simple words, likelihood is a value on Y-axis for fixed data points with the distribution that can be moved. For the ease of computation and avoid exponential calculations we take the logarithm of it.

Another part of the equation is “2k”, where k is the number of model parameters including the intercept. (The equation may vary if the sample size is less than approx. 40.)

To apply AIC in practice, we start with a set of candidate models and then find the models’ corresponding AIC values. There will almost always be information lost due to using a candidate model to represent the “true model” or the process that generated the data.

Let’s say we have fitted 4 models for the same dataset. AIC values for each model will be AIC1, AIC2, AIC3, and AIC4. The model which has the lowest AIC value, considered as the best model, and keeping that model as standard other model performances can be evaluated.

BIC (Bayesian Information Criterion) –

This method is mostly similar to AIC. This method also considers the likelihood the same as AIC. The difference in BIC is that it also considers the number of data points in the data set. Let’s take a look at the equation

K*log (n) — 2(log-likelihood)

All terms are the same from AIC, new parameter n is added which is the number of data points present in the dataset. Model selection is also the same as AIC, the model with the lowest BIC is considered the best. And compare two model’s performance keeping BICmin as a benchmark we have ΔBIC. Looking at the value of ΔBIC, when the value is between 2 and 6, one can say the evidence against the other model is positive; i.e. we have a good argument in favor of our “best model”. If it’s between 6 and 10, the evidence for the best model and against the weaker model is strong. A ΔBIC of greater than 10 means the evidence favoring our best model vs. the alternate is very strong.

So that’s it for this article. (Finally complete!🙌) I hope this article will give you a better understanding of the concepts discussed in it.

If you have made it till the end you are awesome! 🥳

Cheers!!!

References -

Probability and Statistics by Morris H. DeGroot and Mark J. Schervish.

Linear Models in Statistics by Alvin C. Rencher and G. Bruce Schaalje.

Generalized Linear Models by Peter K. Dunn and Gordon K. Smyth.

--

--