Process Models for Data Science Projects: CRISP-DM and KDD
The most common process models for data science projects are CRISP-DM and KDD. The aim of this article is to briefly describe the stages of both models and to identify the differences between them.
CRISP-DM Stages
- Business Understanding
Business Understanding stage focuses on identifying the requirements of the project. Understanding what the requirements are and setting goals accordingly affects the entire project process. This stage is of great importance in all projects in general. If it is skipped, a very different result than desired can be achieved at the end of the process. - Data Understanding
In the Data Understanding stage, the initial data is collected according to the needs decided in the previous stage, the collected data is examined and its properties are defined, a deeper data exploration is made and finally data quality is measured and how clean the data is determined. - Data Preparation
At this stage, final datasets are prepared before proceeding to the modeling stage. For this stage, these steps are followed:
📌 Which datasets will be used and the reasons are determined
📌 Datasets are cleaned again and ready for modeling
📌 New useful attributes are derived within the datasets
📌 New datasets are created by combining data collected from multiple sources
📌 Data is re-formatted according to business needs - Modeling
Models are created using different techniques with the information obtained from the previous stages. Then these models are evaluated by testing and continue like this until the desired result is obtained from the model. - Evaluation
At this stage, it is decided which model best meets the needs. It is checked whether the requirements determined in the Business Understanding phase are met. - Deployment
If all previous steps have been completed and the model has been successful, decision is made to deploy the model. Plan for model deployment is developed, maintenance plans are made for the post-project phase, final reports are documented for the whole process and what is going well and what could be better is observed.
KDD Stages
- Selection
Creating target datasets by acting on the available data/database - Pre-processing
Improving and cleaning the created target datasets, getting rid of faulty or missing data - Transformation
Converting pre-processed data into utilizable data - Data Mining
Searching for patterns depending on the project goal by sifting through the transformed data - Interpretation/Evaluation
Interpretation, evaluation, documentation and visualization of cleaned, transformed and patterned data for helping humans to understand easier the output
CRISP-DM vs KDD
📌 CRISP-DM combines the Selection and Pre-processing stages under the Data Understanding stage.
📌 CRISP-DM stages are reversible. In this way, when an error is made, it is possible to go back and correct the error and make changes without completing the entire cycle.
📌 CRISP-DM differs from KDD with the Business Understanding phase. With the Business Understanding phase, CRISP-DM covers all the steps of building a reliable data science project.