REGRESSION ANALYSIS

Sameer Maurya
8 min readApr 5, 2022

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome’ or ‘response’ variable) and one or more independent variables

Hello everyone, I will shed some light on Regression Analysis in this blog.

Regression is one of the essential concepts in machine learning and data science. Yet, generally, people tend to ignore the importance of regression and overlook power in general.

I will try to shed some light on the same so that it will become much easier for you if you have to perform anything similar in your company.

Before diving deep into the problem, let us explore specific basic things that we must keep in mind while performing regression.

ASSUMPTIONS

  1. There should be no correlation between dependent and independent variables in the dataset
  2. The error that is calculated should be normally distributed
  3. Error term should be independent, i.e., there should not be any pattern visible in the error term
  4. Error term should have constant variance, i.e., homoscedasticity should be present

above are the critical assumptions of regression that we have to keep in mind, and when we are done with the analysis, these are the assumptions that we might need to validate as well

DATA DESCRIPTION AND OBJECTIVE

  1. we have house prices prediction dataset. A link to the data will be added at the end of this blog. Using this data, we will be building linear regression models with different regularization methods, namely, lasso and ridge regression
  2. We will be finding the most significant feature that will be used to predict the price of the house
  3. determining the optimal value of lambda for ridge and lasso regression
  4. fine-tuning the models

DATA LOADING AND PREPROCESSING

We will be loading the data in the jupyter notebook using pandas. I have written the following function, automatically finding the imbalanced and useless features from the data frame.

After passing our data on this function, we found these columns useless, and we will be dropping them

['Alley', 'PoolQC', 'MiscFeature']
['Street', 'LandContour', 'Utilities', 'LandSlope', 'Condition1', 'Condition2', 'RoofMatl', 'ExterCond', 'BsmtCond', 'BsmtFinType2', 'BsmtFinSF2', 'Heating', 'CentralAir', 'Electrical', 'LowQualFinSF', 'BsmtHalfBath', 'KitchenAbvGr', 'Functional', 'GarageQual', 'GarageCond', 'PavedDrive', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SaleType']

Now let us look at the percentage of nan values in the columns first, the continuous features, then the categorical features

Fence           0.808
FireplaceQu 0.473
LotFrontage 0.177
GarageCond 0.055
GarageType 0.055
GarageYrBlt 0.055
GarageFinish 0.055
GarageQual 0.055
BsmtExposure 0.026
BsmtFinType2 0.026
BsmtFinType1 0.025
BsmtCond 0.025
BsmtQual 0.025
MasVnrArea 0.005
MasVnrType 0.005
Electrical 0.001
YearBuilt 0.000
Exterior2nd 0.000
Exterior1st 0.000
ExterQual 0.000

Now let us look at the categorical column

MasVnrType      0.005
BsmtQual 0.025
BsmtCond 0.025
BsmtExposure 0.026
BsmtFinType1 0.025
BsmtFinType2 0.026
Electrical 0.001
FireplaceQu 0.473
GarageType 0.055
GarageFinish 0.055
GarageQual 0.055
GarageCond 0.055
Fence 0.808

we will be dropping those columns which have the highest nan values as they cannot be treated

'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'Fence', 'Electrical'

now we will be finding those columns which have the highest repeating values, and we will be dropping those columns as repeating values add no value to the predictions

and for some columns, we will be filling the nan values with the mean and median values

FEATURE ENGINEERING

now we will be deriving some features and using them for our analysis

def process_garage(row):

if row == 0 or (row>=1900 and row<2000):
return 0
return 1
def remodelled_check(row):
if row['YearBuilt'] == row['YearRemodAdd']:
return 0
elif row['YearBuilt']<row['YearRemodAdd']:
return 1
else:
return 2
def getAge(row):

if(row['YearBuilt'] == row['YearRemodAdd']):
return row['YrSold'] - row['YearBuilt']
else:
return row['YrSold'] - row['YearRemodAdd']

and once we have engineered these features, we will be dropping them from the dataset to reduce the non-colinearity

Now we will once again check for any column which 85% same value, and we will be dropping that as it will add any significant change in the final output

OUTLIER TREATMENT

we will check for any possible outliers in the data once we have treated the data for nan and repeating values

We can see that outliers are present in the data. Therefore, weTherefore, we need to treat them as linear regression is sensitive toward outliers.

# Removing Outliers

# Removing values beyond 98% for LotArea

nn_quartile_LotArea = data['LotArea'].quantile(0.98)
data = data[data["LotArea"] < nn_quartile_LotArea]

# Removing values beyond 98% for MasVnrArea

nn_quartile_MasVnrArea = data['MasVnrArea'].quantile(0.98)
data = data[data["MasVnrArea"] < nn_quartile_MasVnrArea]

# Removing values beyond 99% for TotalBsmtSF

nn_quartile_TotalBsmtSF = data['TotalBsmtSF'].quantile(0.99)
data = data[data["TotalBsmtSF"] < nn_quartile_TotalBsmtSF]

# Removing values beyond 99% for WoodDeckSF

nn_quartile_WoodDeckSF = data['WoodDeckSF'].quantile(0.99)
data = data[data["WoodDeckSF"] < nn_quartile_WoodDeckSF]

# Removing values beyond 99% for OpenPorchSF

nn_quartile_OpenPorchSF = data['OpenPorchSF'].quantile(0.99)
data = data[data["OpenPorchSF"] < nn_quartile_OpenPorchSF]

We can also see that not all attributes are normally distributed in the dataset. We shall treat them later in the analysis for better prediction.

EDA

Now let us plot the relationship between the sale price and different variables.

But we could not see any significant trend in this, so let's plot the correlation matrix and see if there are any correlations.

Now from this, we can see that there are a lot of correlations out there between the variable, and we will need to treat them so that further analysis can be done for the same

we will drop the most correlated columns i.e 'TotRmsAbvGrd', 'GarageArea'

DATA TRANSFORMATION

Now we will be transforming the data so that it can be used by the linear regression model directly

we will be changing the categorical data into numerical form and doing one-hot encoding to other columns

Now finally, we have done all of the preprocessing that is required, and now we can fit the model and do the further analysis

CREATING TRAINING DATA AND TRAINING BASELINE MODEL

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size = 0.3, random_state=42)
LR = LinearRegression()
LR.fit(X_train, y_train)

we will be using Recursive Feature elimination to eliminate the features that are not important and then fit the model again

rfe = RFE(LR)
rfe.fit(X_train, y_train)
cols_selected = X_train.columns[rfe.support_]
temp_df = pd.DataFrame(list(zip(X_train.columns,rfe.support_,rfe.ranking_)), columns=['Variable', 'rfe_support', 'rfe_ranking'])
temp_df = temp_df.loc[temp_df['rfe_support'] == True]
temp_df.reset_index(drop=True, inplace=True)

Now we have the list of only valuable features that will be used for the further analysis

X_train=X_train[cols_selected]
X_test = X_test[cols_selected]

now let us create the models

RIDGE

We need to find the best alpha value for the ridge, so we will be using the grid search for fine-tuning the model and finding the best parameters that fit the model well

params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()

# Tuning to find the best alpha

folds = 5
ridge_model_cv = GridSearchCV(estimator = ridge,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)
ridge_model_cv.fit(X_train, y_train)

now let us look at the model performance with different alpha values

ridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results = ridge_cv_results[ridge_cv_results['param_alpha']<=500]
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
ridge_model_cv.best_estimator_Ridge(alpha=0.0001, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)

now we can see that the best alpha value for the ridge is 0.001

now let us plot the importance of the variable to have a better understanding

we can see that MSZoning_RL is one of the most influential features that is dictating the model performance

now let us do some loss analysis so that we can verify the integrity of the model

LOSS ANALYSIS

let us plot the loss

we can see that normal distribution has been formed, so the regression condition is met here

Here while plotting the scatter plot, there is no pattern visible it is randomly distributed, so we cannot see any bias.

LASSO

Now let us train the lasso regression model and do the analysis on the same

lasso = Lasso()

# list of alphas

params = {'alpha': [0.0001, 0.001, 0.0002, 0.0003, 0.0004,
0.0005, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01, 0.05, 0.1,
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0]}

# cross validation

folds = 5
lasso_model_cv = GridSearchCV(estimator = lasso,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)

lasso_model_cv.fit(X_train, y_train)

now the model is trained, let us look at the best performing models

lasso_cv_results = pd.DataFrame(lasso_model_cv.cv_results_)
lasso_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
lasso_model_cv.best_estimator_Lasso(alpha=0.0001, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)

here as well, we can see that the best alpha comes out to be 0.001

Lasso regression also has one more fantastic feature. Unlike ridge regression, lasso regression can automatically make feature selection for us, so we will be using this to train the model

# creating df to capture the coefficients
lasso_df = pd.DataFrame({'Features':X_train.columns, 'Coefficient':lasso.coef_.round(4), 'absoulte_coefficient':abs(lasso.coef_.round(4))})
lasso_df = lasso_df[lasso_df['Coefficient'] != 0.00]
lasso_df.reset_index(drop=True, inplace=True)
lasso_df

. We will be plotting the feature importance for the same to have a better visual understanding of the same.

here a well we can see that MSZoning_RL has the highest feature importance

LOSS ANALYSIS

now let's plot the loss and see if it follows our assumptions

we can see that normal distribution has been formed

here also, there is no pattern visible, so we can say that there are no biases present in the features

CONCLUSION

we can see that both of the models perform almost the same, but LASSO performs slightly well as compared to RIDGE, not to mention the feature selection capability in LASSO helps us make better predictions

--

--

Sameer Maurya

Senior Data Scientist working in ValueFirst, A Twillio Company having more than 3 years of experience in Tech, find more on www.mauryasameer.com