CRISP-DM Phase 4: Modeling Phase
This is part 5 of the 7-part series’ summary explanation of the openSAP’s 6-week Getting Started with Data Science (Edition 2021) course by Stuart Clarke. Part 4 is here.
Part 4 Recap
In the fourth part of this series, I explained why preparing data is a crucial part in a a data science project, briefly discussed how to prepare data, and provided additional resources with codes in some of the most common data preparation methods in the field.
There are six phases of CRISP-DM with particular tasks and output:
Six phases of CRISP-DM:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
In this article, we will focus on the fourth phase which is Data Modeling. After all the cleaning, formatting, feature engineering (if necessary), and feature selection, we will now feed the data to the chosen model. But how does one select a model to use?
Note: I will only be briefly discussing how to choose a model and what are the available models to use. I will not be covering each model and how it works under the hood. To know more about each different types of model and how it works, you can enroll here: openSAP’s 6-week Getting Started with Data Science (Edition 2021) course.
How to choose a model?
IT DEPENDS. You read that right, it depends. It all depends on what the goal of your task or project is and this should already be identified in the Business Understanding phase of the CRISP-DM.
Steps in choosing a model
- Determine size of training data — if you have a small dataset, fewer number of observations, high number of features, you can choose high bias/low variance algorithms (Linear Regression, Naïve Bayes, Linear SVM). If your dataset is large and has a high number of observations compared to number of features, you can choose a low bias/high variance algorithms (KNN, Decision trees).
- Accuracy and/or interpretability of the output — if your goal is inference, choose restrictive models as it is more interpretable (Linear Regression, Least Squares). If your goal is higher accuracy, then choose flexible models (Bagging, Boosting, SVM).
- Speed or training time — always remember that higher accuracy as well as large datasets means higher training time. Examples of easy to run and to implement algorithms are: Naïve Bayes, Linear and Logistic Regression. Some examples of algorithms that need more time to train are: SVM, Neural Networks, and Random Forests.
- Linearity —try checking first the linearity of your data by fitting a linear line or by trying to run a logistic regression, you can also check their residual errors. Higher errors mean that the data is not linear and needs complex algorithms to fit. If data is Linear, you can choose: Linear Regression, Logistic Regression, Support Vector Machines. If Non-linear: Kernel SVM, Random Forest, Neural Nets.
Parametric vs. Non-Parametric Machine Learning Models
Parametric Machine Learning Algorithms
Parametric ML Algorithms are algorithms that simplify the function to a know form. They are often are called the “Linear ML Algorithms”.
Parametric ML Algorithms
- Logistic Regression
- Linear Discriminant Analysis
- Perceptron
- Naïve Bayes
- Simple Neural Networks
Benefits of Parametric ML Algorithms
- Simpler — easy to understand methods and easy to interpret results
- Speed — very fast to learn from the data provided
- Less data — it does not require as much training data
Limitations of Parametric ML Algorithms
- Limited Complexity —suited only to simpler problems
- Poor Fit — the methods are unlikely to match the underlying mapping function
Non-Parametric Machine Learning Algorithms
Non-Parametric ML Algorithms are algorithms that do not make assumptions about the form of the mapping functions. It is good to use when you have a lot of data and no prior knowledge and you don’t want to worry too much about choosing the right features.
Non-Parametric ML Algorithms
- K-Nearest Neighbors (KNN)
- Decision Trees like CART
- Support Vector Machines (SVM)
Benefits of Non-Parametric ML Algorithms
- Flexibility— it is capable of fitting a large number of functional forms
- Power — do not assume about the underlying function
- Performance — able to give a higher performance models for predictions
Limitations of Non-Parametric ML Algorithms
- Needs more data — requires a large training dataset
- Slower processing — they often have more parameters which means that training time is much longer
- Overfitting — higher risk of overfitting the training data and results are harder to explain why specific predictions were made
The Breakdown
In the course, Stuart have broken down the tasks and output of the fourth phase in detail. *see below
In the process flow above, Data Modeling is broken down into four tasks together with its projected outcome or output in detail.
Simply put, the Data Modeling phase’s goal is to:
- Select Modeling Technique by selecting the actual modeling technique to be used. This should already be identified in the Business Understanding phase. Do not forget to document the modeling technique that is to be used and other models to be used.
- Generate Test Design by generating a procedure to test the model’s quality and validity. Here, you will able to describe the intended plan for training, testing, and how to evaluate the models.
- Build Model by running the model using the prepared dataset. Once the model has been tested to run, list the parameters and their chosen value, and the rationale for the choice of parameter setting as you do not want to run an already run model again and again just because you forgot what the initial parameters were used.
- Assess Model by interpreting the models according to domain knowledge and data science criteria. In this stage, you will need to summarize the results of the generated models and rank their quality in relation to each other. Assess model performance metrics, graphs, and confusion matrix.
In the next part, we will talk about the fifth phase which is the Evaluation Phase. If you are working on a data science project for your company or even personal project/s, try to apply the above steps if applicable. As again, different data science projects have different sets of requirement. The CRISP-DM methodology just serves as a template to ensure you have considered all of the different aspects specific to your project.
References: