Linear Regression Modelling — A systematic approach

Published in

Analytics Vidhya

7 min readMay 16, 2022

1. Linear Regression — an introduction:

Linear Regression is one of the most simple, old, proven, and yet (can be) sophisticated prediction techniques in the Data Science domain. It is used to predict a target variable, which is continuous in nature such as sales, inventory, profit, rainfall, etc.

Linear regression is classified as a supervised learning and parametric algorithm.

Supervised learning is an approach where an algorithm is trained on input data (referred to as ‘training data’) that has been labeled for a particular output (often referred to as dependent or target variable). Parametric algorithms assume a particular form for the function and estimate parameters (coefficients).

For example, Linear regression attempts to fit the following form of equation:

Y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4 ….

Further, the following conditions must be met to successfully fit a Linear regression model:

i) The relationship between dependent and independent variables should be linear and additive.

ii) There should be no correlation among error (residue) terms.

iii) Independent variables should not be correlated.

iv) Error term should have constant variance.

v) The error term must be normally distributed.

Two important use cases or objective(s) for Linear regression are as follows:

i. Be able to predict the future i.e., the user is predominantly interested in the prediction of the target variable.

ii. Be able to interpret the impact of independent variables on the dependent variable i.e., the user is predominantly interested in using coefficients of independent variables for improving the target outcome.

Many times, the second objective is more important than the first one as it facilitates planning changes in business strategies and measuring the impact in a very scientific manner. However real-world is too complex to fit in a model, hence linear regression may not work due to one or more factors:

i. There may be a few unknown factors impacting the target variable.

ii. A few factors may be known however collection of data may not be feasible

iii. The true relationship between target and independent variable may be beyond the scope of Linear Regression (that is allowed by Linear Regression).

Two more aspect needs to be mentioned:

i) The term ‘Linear’ is somewhat misleading! It is possible to include terms like x1*x1, x1*x2, etc. however the basic form of the equation (viz. y = b0 + b1*x1 + …) should not change! However, coefficients for all terms must be linear!

ii) It may be necessary to transform a few select variables as sqrt(x) or log(x) so that their distribution becomes near normal. Then Linear Regression may yield good results.

iii) Advanced form of Linear Regression viz. Lasso or Ridge simply help to retain the most important variables by making coefficients of others as zero or near-zero. It is based on a very simple but powerful idea of including a penalty term in the error function so that during the optimization process, a few (less important) variables are made zero or near zero.

2. Process of fitting a Linear Regression Model:

The step-by-step process involved in implementing a model has been listed below. The first few steps are common for any modelling problem:

2.1 Business Problem: Clearly understand the business problem and the purpose (use case) of building a prediction model and how it will be used in practice, and what business decisions will be taken based on the model. Otherwise, one may spend days/weeks in accurately solving a wrong problem, which is of no use for anyone!

If it is a pilot project for demonstrating the power of AI/Data Science, then choose one (or prioritize one) based on your judgment and where sufficient good quality data is available.

Please note, that the involvement of business leaders from day-1 is the most important prerequisite for the success of data science projects. Continuous interaction with the business is also a must.

2.2 Data Sources: Understand all data sources, information

(variables) available, and how they are captured in the source systems. Normally extraction (part of Extract-Transform-Load i.e. ETL) is also part of a data science project and can be a project in itself due to the nature of systems/databases. For example, extracting from ERP such as SAP may need procurement of specific tools and licenses in some cases. Almost in most scenarios, a copy of Transaction-Master data needs to be created as the Data Science team will not be generally allowed to access operational databases.

2.3 Understanding Data: Understand every datum (variable)

in the data, its format (numeric, character, string, etc.), its business context, target variable, and how it is being captured.

2.4 Choice of an algorithm: Decide whether linear regression

should be attempted as the first choice of modelling! The dependent variable should be of continuous nature.

2.5 Hypothesis List: Pl do list as many hypotheses as possible. They will be very handy while choosing or transforming variables or imagining/deriving new variables.

2.6 Exploratory Data Analysis (EDA) –

2.6.1 Basic Conversion: Here basic conversions such as converting date-time variables to proper date-time variables and extracting various components such as hour, day of the week, month, etc. depending on the context of the problem are done. Similarly, object variables, if any, are converted to numeric or categorical variables. (NB: Conversion to categorical variables can be done later as it limits mathematical operations required to be done during EDA).

2.6.2 Univariate Analysis: It involves studying patterns of all variables individually. The purpose is to understand the pattern, distribution, whether skewed or not, and quantum of outliers.

Variables are classified as numeric (integer or float) or categorical (ordinal or nominal).

· For continuous numeric variables, as part of graphical analysis, ‘histplot’ is drawn to visualize the distribution, whether skewed or not, ‘boxplot’ to identify outliers if any.

· In the case of numeric variables or categorical variables (with a limited number of values), a ‘countplot’ is drawn to analyse variation across categories, if any.

2.6.3 Bivariate Analysis: It involves graphical analysis of the relationship between:

· Independent variables (continuous or categorical) with the dependent variable (continuous in case of Linear Regression) using barplot or catplot or lineplot (as appropriate — e.g. lineplot is used in order to show pattern as independent variable changes (such as hour, month, etc.). Barplot or catplot is used to compare the value of the dependent variable with the unique values of the independent variable.

· Similarly analyze the relationship among independent variables using an appropriate plot (barplot, catplot, lineplot or scatter plot), etc.

2.6.4 Statistical tests: Further relationship among independent variables is analyzed as follows:

· Using Pearson correlation coefficient among numeric continuous variables. If there is a strong correlation between independent variables, then one of them needs to be dropped.

· To decide the relationship between continuous-categorical variables, a t-test or z-test is used.

· To decide the relationship between categorical-categorical variables, the chi-square test is used.

2.7 Feature engineering: In this step, features are scaled,

transformed, and made ready for actual model fitting. This step is very important for the correct working of the model.

· Variables (for example ID) that do not add value to the modelling phase can be deleted. The field can be deleted later as it may be required for re-joining subsets of data created during feature engineering.

· Outliers especially extreme outliers need to be processed appropriately (deletion if any should be the last choice.)

· Scaling of variables using MinMaxScaler, Standardize or RobustScaler must be done for algorithms to work well.

· Variables such as city, branch, PIN code, gender… (basically codes: master data) or where variable indicates slab wise data (height: tall, medium, short, etc.) or even other variables such as ‘hour’ of the day.

o e.g. ‘hour’ of the day can be defined as

§ Numerical continuous variable

§ Categorical variable (24 values!)

§ Categorical variable by clubbing certain hours of the day depending on the context (such as Early morning, morning, afternoon, evening, night, etc)

· Hence decisions on the conversion of variables need to be taken depending on — which one will show the pattern as relevant to the problem.

· Continuous variables which are skewed or not normally distributed need to be converted to log or sqrt or other appropriate function. Pl do check the respective plot after conversion is done.

· Deriving new variables is a very crucial part of feature engineering. For example, in the case of transportation problem(s), when geo-coordinates of starting and endpoints are provided then the following options can be considered:

o Using Google APIs or equivalent (if available) extract the zip code/pin code of the location and get the actual road distance for the shortest route.

o If not available, then get the distance (haversine or Manhattan) between two points and so on

· Finally generate dummy variables for all categorical variables using the ‘dummies’ function. Sometimes this may lead to a ‘dummy trap’ where these are collinear giving absurd results. In that case delete one variable (first or last) i.e., if a categorical variable takes ’n’ values, drop 1 variable leaving ‘n-1’ variables.

· Now we are ready to fit a model!

2.8 Model Fitting: The final step after the above work is very simple.

· Choose a metric such as ‘Mean Absolute Error’, ‘Mean Squared Error’ etc.

· Split data into Train and Test data in desired percentages (e.g., 75,25)

· Develop a simple baseline benchmark model and measure error.

· Fit the regression model using Standard option or Lasso, Ridge regression.

· Predict for Test data and measure error for Test data & compare with benchmark model error. It should be less with the improved model!

· Check whether conditions for Linear Regression (given above) are met.

· Have a look at coefficients of the variables and whether they make sense!

· Model is ready for deployment.

3.0 Conclusion:

It is quite easy to implement a Linear Regression model for any non-trivial prediction problems. One needs to have a good knowledge of Python, Statistics, understanding of business domain, be good at symbol manipulation, lots of common sense and have passion for modelling real-world systems!

Linear Regression Modelling — A systematic approach

Written by Jayantakut