The CRISP-DM methodology

Lucas Medeiros
9 min readMay 26, 2021

--

Since starting to work with data science projects, I found that dividing each problem solution into a chronogram of steps made all projects much more simple to solve, so that in every project I could already have a secure plan to follow. This is also known as the CRISP-DM cycle, meaning Cross-Industry Standard Process for Data Mining. From the book “Data Science for Business”:

Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages. The Cross Industry Standard Process for Data Mining, abbreviated CRISP-DM (CRISPDM Project, 2000), is one codification of this process. Keeping such a process in mind provides a framework to structure our thinking about data analytics problems.

Cross-Industry Standard Process for Data Mining (CRISP-DM project, 2000).

In the Journal of Data Warehousing magazine, the CRISP model was first presented in the article ‘The CRISP-DM Model: The New Blueprint for Data Mining’. As defined in the text:

CRISP-DM organizes the data mining process into six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. These phases help organizations understand the data mining process and provide a road map to follow while planning and carrying out a data mining project. This article explores all six phases, including the tasks involved with each phase.

Since then, a lot has happened in the field of data science. For instance, the term “data science” itself became popular, in contrast with “data mining” so often used as in the name of the method itself: CRISP-DM. Nowadays, it would be possible to call it “CRISP-DS”. But other more significant changes have happened, as in the number of tools created and disseminated to the practice of solving data science problems.

This has made the awareness of many details in the CRISP method much more pronounced, so that in a website such as Kaggle we can find thousands of open datasets with projects and discussions on categorical features encoding, on different types of cross-validation, and how to divide datasets into train sets, validation sets, and test set, data shift problems for imbalanced classification, and so on. Considering this new context, I will present below the steps I consider useful when doing the CRISP cycle.

  1. Context and problem/solution planning

Data science problems should be understood in the context of a business model one is trying to improve. Business models can be described as the model to which a company creates, delivers, or captures value for the customer. The most common business models, and most useful to be understood as a data scientist, are:

  • E-commerce: profit comes from selling products online.
  • Software as a Service (Saas): profit comes from monthly or annual subscription payments, for the use of a software service.
  • Services: profit comes from selling a project or service.
  • Mobile App: profit comes from in-app buys.
  • Media Site: profit can come from the number of clicks or views in a given add, marketed by the website.
  • Market Place: profit comes from fees charged at each transaction facilitated by the company.

Once the problem is defined inside one of these business models, the following sequence of steps might be ideal to plan for a data science solution:

  • Identifying the final solutions and products that might completely solve the problem, whether it is a machine learning model, an insights report, or an ETL to a given dashboard, etc.
  • Going through all the intermediary work necessary to achieve the solutions.
  • Defining both the raw data and the tools (programming languages, visualization applications, etc) necessary to do the work.

Once the end, middle, and beginning of the project are well conceptualized, is time to start the hands-on work. But first, as a reminder of the project planning, a good practice is to create a mind map surrounding the data and the business features, which can be easily made on websites such as ‘coggle.com’ and ‘miro.com’. As an example of a project for sales prediction for a pharmaceutical company:

2. Data collection, cleaning, and descriptive analysis

This is the beginning of the process of making sense of the data and creating as much value as we can with it. The objective is to understand each variable, go through adjustments such as ‘nan’ fill out and data types corrections. In this step, a few basic descriptions are made:

  • the shape of the data frame is defined.
  • the column names are defined, which may or may not need corrections.
  • the number of missing values in each column (feature), the percentage of missing values in comparison to the overall number of rows, and the nature of this lack of values registered. An understanding of the subject matter and the nature of the phenomenon is important to be able to deal with and eventually exclude or fill out these values.
  • the types of data contained in each column are checked and corrected.

3. Feature Engineering

Now each feature in the data frame is hopefully understood, well named, free of missing values, and stored to its correct data type. So to expand the information that is already at hand and better represent the information we have, the best practice is to perform feature engineering, meaning the creation of new features from the ones already existing.

This is a step in the machine learning pipeline that often contains most of the creative aspects of the problem's solution. At first sight, it may seem as feature engineering cannot add much information to what is already contained in the data set. But even after a small number of projects made, it is clear to me that this is a decisive step in the posterior results of our models. Also, any excess features created will not be a problem since a feature selection step is going to be performed ahead in the pipeline. And at that point, often is the case when features created here will turn the ones with the highest importance for predicting power.

4. Filter

Placing a filter at this stage in the pipeline guarantees that only valuable information is used in the insight creation. All of the data available is used in the descriptive analysis and the feature engineering process. Most of the filtering occurs by ‘dropping’ (or excluding) rows with unimportant data.

5. EDA

The exploratory data analysis is where the dataset becomes visible and easily presented, both for the data management team as for any other audience to which the business problem might be presented.

The main objectives of an exploratory data analysis are:

  • to develop an understanding of the business problem
  • to generate and validate new insights for the business team (data-driven decision making)
  • to evaluate the impact of features on a given phenomenon

To evaluate the impact of each feature in the phenomenon, a univariate, bivariate, and multivariate analysis is performed with all features. The univariate analysis consists of :

  • data distribution
  • min, max, and range
  • Q1 and Q3
  • Skew and Kurtosis

Bivariate analysis is performed considering the nature of each feature, as pointed below:

Metrics in bivariate analysis.

And multivariate analysis is done with a correlation matrix, as the following types:

  • Pearson
  • Spearman
  • Cramer´s V

5. Data Preparation

When preparing the data for an ML model to learn, the scale of each feature has to be adapted so that the arbitrary nature of the measurement of each feature does not interfere in the learning process. This is done by either:

  • Normalization: when the feature has a distribution close to a gaussian distribution, and its center is rescaled to 0 and its standard deviation is rescaled to 1.
  • Min-Max Scaler: the distribution is rescaled to a range between 0 and 1.
  • Robust Scaler: the distribution between Q1 and Q3 is rescaled to a range between 0 and 1, turning this method more robust to the influence of outliers.

Also, categorical features have to be converted into a new type of numerical nature, which configures the encoding process. According to the reasoning of the data scientist, there are many types of encoding of categorical features, and each situation may call for the use of a specific type, according to the reasoning of the data scientist. They are:

  • One Hot Encoding (Dummy Encoding)
  • Label Encoding
  • Ordinal Encoding
  • Target Encoding
  • Frequency Encoding
  • Embedding Encoding

And finally, some features might need to have a transformation of their nature performed so that the patterns of data can be learned more precisely, as is the case with some time features that are cyclical (weeks, months, etc). The types of nature transformation performed in data preparation are:

  • Logarithm Transformation
  • Box-Cox Transformation
  • Cube-Root Transformation
  • Square-Root Transformation
  • Sine-Cosine Transformation

6. Feature Selection

The idea behind feature selection is the “Occam´s Razor Principle”. The principle says that the simplest explanation for a given phenomenon must prevail over any other. With this in mind, the types of feature selection that I often apply to the dataset are:

  • Importance of each feature, given by applying a tree-based ML algorithm, like Random Forest or Extra Trees.
  • Boruta: features do not compete among themselves. Instead, they compete with a randomized version of them.

7. ML Models

This is where split the data between train and test, fitting different models to the training set and trying to predict testing set. It is aways recommended to perform cross validation to be as close as possible to reality and be able to choose the best fitting model for the next steps.

8. Metrics Evaluation

Once the best algorithms are selected (which at this step can still be more than one), is time to evaluate the performance of each model in a testing set. The metrics of evaluation change according to the type of problem at hand. Regression often is evaluated by RSME. Classification might vary its metrics depending on the balancing of the response variable.

In the clusterization approach, the metrics are often defined by the practical evaluation of a given cluster compared to a homogeneous definition of the dataset. As the cluster definition’s refinement occurs, the cluster’s performance in defining the phenomenon tends to improve.

A great tool to use in metrics evaluation is the library ‘Yellow Brick’, which displays good visualizations in graphs.

9. Hyperparameter Fine Tuning

After finding the best-performing model, is time to tune its parameters so that a slightly better performance might be achieved with it. Each type of model will have its own parameters, for example, for a ‘Random Forest’ algorithm, these are the main parameters that, when tuned correctly, might generate an increase in performance:

  • max_features
  • n_estimators
  • max_depth
  • min_samples_split

There are three strategies to investigate which of these combinations of parameters will point to the best fitting model:

  • Random Search: a collection of parameters is defined randomly and tested so that the one combination with the best performance is chosen. This is a fairly quick process, but the combinations of parameters will tend to be far from the best-fitted ones since the number of combinations tends to be small and the combination is random.
  • Grid Search: from a list of parameters, each one with a range of values more likely to perform best, this process creates a list with all parameter combinations, and each combination is tested so that at the end the one with the best performance is chosen.
  • Bayesian Search: define values for parameters following Bayesian theory.

10. Business Predictions

Now with the model ready to generate value inside the company, its time to use it by simulating its predictive power.

11. Model Deploy

Deploying a model is the final step in the crisp cycle. As seen in the original article ‘The CRISP-DM Model: The New Blueprint for Data Mining’ (Journal of Data Warehousing, 2000):

‘Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise’.

This is a step that requires the data scientist to condense the entire process undertaken in the cycle to solve the problem at hand. So now, there are three main steps needed to deliver a model to production. These are:

  • create a Handler_API.py to receive a request from an API
  • create a model object .
  • create a data preparation containing all of the transformations imposed on the data until the model’s training.

So with all of these steps defined it is time to iterate over crisp01, crisp02, crisp03, etc. and see how much value we can take from our data by simply moving one step at a time.

--

--