Standard Steps Which Can Be Followed When Performing Machine Learning Modeling

Sai Varun Immidi
The Startup
Published in
10 min readJun 9, 2020

Disclaimer : The steps explained can give you a high-level understanding in model building, but one can perform in much more detail as per the data in hand. It is expected that all Data collection, Hypothesis Formulation, Problem mapping and Solution Approach is done. Please bear with me as it is a lengthy article.

It can be Supervised or Unsupervised algorithms. For both learning methods some standard steps in modeling can be followed on a general heuristic way. Before starting with the steps involved in modeling, let’s give you some heads up regarding Supervised and Unsupervised learning methods. Supervised learning methods will have some historic predefined labels, using which model is built and predictions are made. Whereas in Unsupervised learning there doesn’t exists any pre-defined labels. Upon performing modeling, we identify the labels associated with the data. Supervised learning methods involve Regression and Classification Algorithms and Unsupervised learning includes Clustering Algorithms. Having an understanding on the differences between Supervised and Unsupervised learning let’s get to the steps involved in modeling:

  • Data Inspection and Understanding the data
  • Data Modifications (If any)
  • Exploratory Data Analysis
  • Data Preprocessing
  • Data Modeling
  • Model Evaluation
  • Model Validation
  • Model Stability Analysis
  • Model Governance

On a general heuristic the above-mentioned steps can be followed when performing supervised modeling. Let’s now focus on each of the steps:

Data Inspection and Understanding the data: Before performing any modeling technique it is necessary to understand the data first. In this stage we read the data into dataframes and look at its attributes to get high level idea about the data like shape, data types of the attributes, statistical summary of the numerical columns present, info of the data, values and columns present in the data. Knowing them we can get an idea about the categorical and numerical columns present in the data. Once data inspection and understanding of the data is done, we move on to next step

Data Modification: In this step we perform data type modifications of the columns if any column has been wrongly assigned to another data type. Also, some data standardization can be done if any like making the values in a column into one unit, removing in valid values etc.

Exploratory Data Analysis: Also known as EDA. This step takes more time as we perform Data Cleaning, Identifying and treating the Outliers, Univariate and Bivariate Analysis. Upon performing EDA, we will get to know the interaction between the attributes or columns and also variation of data spread within a column. Hence it is necessary and most important step which needs to be performed before preparing the data for modeling.

Starting with Data Cleaning, in this step we check for the missing values and skewness in the columns. In data cleaning one is open to explore all his dimensions in order to perform accurate cleaning process. The better the cleaning process is performed the better will be the results obtained from model. Some of the missing values treatment methods are: The missing values can be replaced with the WOE transformation of the column OR They can be imputed with statistical analysis such that no skewness is created in the column OR If the missing values in a column are more then that particular column can be dropped OR If any additional data source is available for imputing the proper missing values then it should be done. Apart from these missing value treatment methods there are some higher dimensional treatment methods like MCMC (Markov’s Chain Monte Carlo) and Expectation Maximization methods. One thing need to be remembered when dealing with missing values, upon identifying the missing it is necessary to understand the cause behind the missing values like whether the missing values have been occurred as MAR (Missing At Random), MNAR (Missing Not At Random) then accordingly the imputation methods to be performed. If the categorical columns have been imputed with Mode, it is always better to check for variation present across classes of the column so as to see whether any skewness is created in the column or not. Also, there exists some of the columns which don’t have any missing values but are skewed in nature. In order to hand the columns having skewness some of the statistical methods can be followed like Log Transformation, Square root transformation OR Box-cox Transformation. Upon using the transformations on variables, the skew coefficient will be significantly reduced. But one should remember to reverse back to the same values of the column when making predictions. The skewness in the variable is necessary to handle when using the variable for modeling as it will deviate the predictions made when using the skewed variables in the data.

Once missing values and skewness in the columns are treated, we move forward for Outlier Analysis. In Outlier analysis we check for the outliers in numerical columns using Box plot identification OR By analyzing the quantile variation of the variable OR By identifying using statistical methods like Z score analysis and IQR methods. Upon identifying the outliers some of the possible treatment techniques are: Outliers can be treated as missing values and can be imputed with missing values treatment OR Extreme outliers can be capped with the closest percentile value OR If such extreme outliers are not necessary for your analysis, they can be dropped. When treating Outliers, one should take into consideration of both quantitative and qualitative analysis.

We move on to Univariate and Bivariate analysis. We start with Univariate analysis. The term Univariate means analyzing one single variable. Upon performing Univariate analysis, we will get to know the variation of data in a variable. By plotting dist plots we can understand the variation in a variable. For both categorical and numerical variables Univariate analysis is performed. Then we perform Bivariate analysis. The term Bivariate means analyzing one variable in relation to another variable. Doing so, we will get to understand the interaction between variables. The possible Bivariate analysis are: Numerical to Numerical, Categorical to Numerical and Categorical to Categorical. Also, in Bivariate analysis Pair wise correlations are identified so as to check which variables are strongly correlated to each other. Among such strongly correlated pairs one of the variable can be dropped to avoid correlations between the predictor variables. Using only pair wise correlations alone will not help in identification of Multicollinearity as one of the variable can be correlated with more than one variable. In order to address such problem, we take help of VIF scores to eliminate variables having high multicollinearity during modeling. Before performing pair wise correlations, we first create dummy variables for categorical variables and then check for the pair wise correlations to remove strongly correlated pairs.

The steps up to Exploratory Data analysis is same for both Supervised and Unsupervised Learning methods.

Upon completing long process of EDA task, we move on to Data Preprocessing. In Data preprocessing, we prepare the data as required for the modeling. The steps involved are: Creating Dummy variables, Splitting the data into Train-Test datasets, Scaling the train data. We start with Dummy variable creation, in which the categorical variables having n level classes have been encoded with n-1 levels of dummy variables. As mentioned earlier, pair wise correlations can be found after creation of dummy variables. Upon plotting heat map, we can get to know the pair wise correlations between the numerical variables of the data. Using which we can drop one of the variable among strongly correlated pairs. This activity of dropping one variable among the strongly correlated variables is confined to Supervised learning methods. In Unsupervised learning methods we will just scale the variables, perform some feature engineering and then perform modeling. We then move on to Train-Test split of the data in some ratio of 70:30 or 80:20. We then scale the Train data set using some scaling techniques like Standardization or Normalization as per the requirement. The test data set can be transformed at the time of making predictions using the scaler object which is defined during scaling of train data set.

Once data is preprocessed accordingly as per the modeling requirement we move on to Modeling. In the Modeling stage there are some points confined to Supervised learning methods and some to Unsupervised learning methods. In Supervised Modeling we perform modeling in two steps when having many feature variables: First we perform Automated Feature Selection followed by Manual Feature Elimination using P values and VIF values of the features. If having some limited number of variables then we can perform one of the following methods: Forward selection, Backward selection or Step selection using some criteria like AIC (Akaike information criterion) etc. In real time we will having many predictor variables. One of the simplest Automated Feature Selection process is RFE (Recursive Feature Elimination) which gives out top N number of features as mentioned. The top N features are selected based on their coefficient values. Upon using the Top N features, we can further perform Manual Feature Elimination in order to have a light model which is free from Overfitting issues. When Modeling there always exists trade-off between Bias-Variance hence we need to build a balance model which manages between Bias and Variance. In every stage of Manual Feature Elimination, we keep track of any one of the model metric like R square or Adjusted R square when performing Regression algorithms and Accuracy in case of Classification algorithms. The reason for keeping track of model metrics so as to see that we are not losing any of the explanatory features during Manual Feature Elimination. Doing so once we reach to the final stable model having all coefficients significant and metrics normal, we can move forward to Model Evaluation.

Once we attain a final stable model, we perform Model Evaluation using some model metrics which determine the goodness of fit of the model. These metrics will be evaluated both on the Train and Test data sets. In the Model Evaluation we check on the Train data set whereas in Model Validation we check on the Validation set or Test set. Again, Model Evaluation varies as per the type of algorithm we perform. When performing Regression technique, we check for the R square, Adjusted R square, F statistics, P value of F statistics in order to determine goodness of the fit and whether overall model is significant or not. In case of Classification, we perform Model Evaluation considering either Sensitivity-Specificity View or Precision-Recall view as Accuracy alone metric would not be sufficient in determining the goodness of fit of the model as Accuracy accounts for both positives and negatives but mostly as per the business requirement we may be needed more Sensitivity or more Specificity. This holds the same for Precision and Recall view. Also, in Classification we can plot ROC (Receiver Operating Characteristics) through we can get to know whether the model is stable or not. The Curve showcases the variation between TPR and FPR. A good Classification model should have high TPR and low FPR and the curve should be almost touching the Y-axis (TPR axis) so that AOC (Area Under Curve) is almost close to 1 unit. To quantify whether the model is stable or not from ROC curve we check for the AOC value. Higher the AOC value close to 1.0 better the model is. In case of Unsupervised Learning like Clustering there doesn’t exists any specific Model Evaluation Metrics, just the Clusters formed should be interpretable and distinct from each other.

Once Model Evaluation is done, we move on Model Validation. The Model Validation techniques also varies between Supervised and Unsupervised Learning algorithms. In case of Supervised Learning algorithms, some validation techniques using which the model can be better validated. Some of the validation techniques are: In sample validation, K-cross validation, Out of time validation. Upon performing one of these validation techniques we can make predictions on the Test data set using the model and then evaluate the output of the model on the Test data set using evaluation metrics. Doing so, we will get to know whether the model is able to identify general patterns in the Train data set or Overfitted or Underfitted on the Train data set. If model evaluation metrics are in same range of model evaluation metrics of the Train data set, then it can be said that model has captured only general patterns in the data and thus performing decent on the Test data set as well. Whereas in Unsupervised Learning algorithms the predictions made, and clusters formed on the Test data set should be interpretable and distinct. Apart from this there are no much specific Model Validation techniques in Unsupervised Learning methods.

Once Model Validation is done, we move on to Model Stability analysis. This Model Stability analysis is more specific to Supervised Learning methods. Sometimes Model Stability is much more important than Predictive power of the model as we use the model for taking critical business decision hence, we expect the model to have stability. In Model Stability, we check for the Variable Stability and Performance Stability. A model is said to be having Performance stability if the Predictive power of the model on both Train and Test data set is close enough then the model is said to have Performance Stability. To check for Variable Stability, the Variable Stability is divided into categories: Variable Distribution Stability and Variable Predictive Stability. A model is said to have Variable Distribution Stability if the variables distribution is almost same across both Train and Test data set. To quantify whether variables are distributed similarly or not we measure PSI (Population Stability Index) for both Train and Test data sets. To check for Variable Predictive Stability, we analysis the WOE values of the variables in both Train and Test data sets. The reason for check WOE values across the groups of the variables is that WOE captures the predictive nature of the variable. To quantify this, we check for the PSI value associate with WOE values.

Once Model Stability is checked, we move on to Model Governance or Model Tracking. This activity is performed when the Model is under production. There are different automated tools available in order to track the model performance based on the model algorithm. This Model Governance varies as per the Organization Standards and can be explored in much more detail. On a general note in Model Tracking we keep track of the all sorts of Model metrics, Predictions made by the model on the unseen data sets etc. During the process if any Re-calibration is needed it should be done. If the model predictions are deviating more than 1 instant, then there is a great necessity of Rebuilding the model again.

These are some of the steps which can be followed on a general note when performing any of the modeling techniques.

REFERENCES:
UpGrad Learning Platform.

--

--