Implementing Linear Regression with Categorical variable Using Sklearn

Published in

Analytics Vidhya

6 min readJul 16, 2020

Easy Steps for implementing Linear regression from Scratch

Linear regression is the most simple ‘Machine Learning’ and important algorithms. For basic understanding please refer my previous blog:https://medium.com/analytics-vidhya/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-6b0fe70b32d7

In this article, we will talk about

Imports
Dataset
Training and test dataset
predictions
Conclusion

Imports

We will be using pandas, numpy sklearn, seaborn and matplotlib

I’ve also imported warnings module so the Notebook remains clean:

import warnings
warnings.filterwarnings('ignore')import pandas as pd
import numpy as npimport seaborn as sns
import matplotlib.pyplot as plt

now I am importing the dataset

df=df = pd.read_excel('Multiple_variable.xlsx',sheet_name='Sheet2')
df.head()

We have Age, Yearsof Experience, Salary, Gender, Classification and Job.

Data Pre-Processing

df.shape
df.describe()

From the above output, I will try to see the nature of the dataset, when I am saying nature that means is dataset is following normal distribution or not? or is the dataset is noisy, in this scenario we will see the trend column-wise?

So for that, we will observe the values of Mean, Std, Min, Median(median represents the 50th percentile or the middle value of the data), and Max

df.dtypes

EDA

sns.pairplot(df,hue='Gender')

From the above graph, if you notice Years of Experience is directly proportional to Salary and even age.

for more details, let's look at the below graph :

sns.pairplot(df,x_vars=['Age','YearsExperience'],y_vars=['Salary'],hue='Gender')

df.corr()

Correlation and regression analysis are related in the sense that both deal with relationships among variables. The correlation coefficient is a measure of linear association between two variables. The values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a correlation coefficient of -1 indicates that two variables are perfectly related in a negative linear sense, and a correlation coefficient of 0 indicates that there is no linear relationship between the two variables.

Here we can see Age and Years of Experience both are correlated with Salary.

sns.heatmap(df.corr(),annot=True,lw=1)

If you want to see the trend between two variables you can see with the box plot.

sns.boxplot(y='Age',x='Gender',data=df)

sns.boxplot(y='Salary',x='Classification',data=df)

After doing all analysis we understand we have some categorical variables as well, so we need to add dummy variables, you might be wondering what is dummy variable?

What is a Dummy Variable?

A dummy variable (is, an indicator variable) is a numeric variable that represents categorical data, such as gender, race, etc.

What are the benefits of a Dummy Variable?

Regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.

so our independent variables would be

X = df[['Age', 'YearsExperience', 'Gender', 'Classification', 'Job']]

In the above data frame, we have Gender, Classification, and Job as a categorical variable, so we need to add dummy variables instead.

X = pd.get_dummies(data=X, drop_first=True)
X.head()

above code added the dummy variable in form of 0 and 1, which is easy to interpret for the regression model.

Y = df['Salary']

Creating a train and test dataset.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

After splitting the dataset into a test and train we will be importing the Linear Regression model.

from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X_train,y_train)

# print the intercept
print(model.intercept_)

The intercept (often labeled the constant) is the expected mean value of Y when all X=0. In a purely mathematical sense, this definition is correct. Unfortunately, it’s frequently impossible to set all variables to zero because this combination can be an impossible or irrational arrangement.

coeff_parameter = pd.DataFrame(model.coef_,X.columns,columns=['Coefficient'])
coeff_parameter

The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.

A positive sign indicates that as the predictor variable increases, the Target variable also increases.
A negative sign indicates that as the predictor variable increases, the Target variable decreases.

predictions = model.predict(X_test)
predictions

Yaay, here is your predicted variable.

sns.regplot(y_test,predictions)

The above graph shows our model is predicting good results. lets see Rsquare value

import statsmodels.api as sm
X_train_Sm= sm.add_constant(X_train)X_train_Sm= sm.add_constant(X_train)
ls=sm.OLS(y_train,X_train_Sm).fit()
print(ls.summary())

What Are the Adjusted R-squared?

We Use adjusted R-squared to compare the goodness-of-fit for regression models that contain different numbers of independent variables.

Let’s say you are comparing a model with five independent variables to a model with one variable and the five variable model has a higher R-squared. Is the model with five variables actually a better model, or does it just have more variables? To determine this, just compare the adjusted R-squared values!

I tried to touch base most of the concepts, you can try out to calculate some metrics like MSE and RMSE.

Conclusion :

As I said before Linear regression is really amazing algorithms and there are lots of things you can do with this. try out new things and let me know if you any suggestions.

I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

Prabhat Pathak (Linkedin profile) is an Associate Analyst.