CRISP-DM framework: A foundational data mining process model

Avikumar Talaviya
7 min readOct 30, 2023

--

Learn about what the CRISP-DM framework is really all about and gain an in-depth theoretical understanding of each phase of the data mining process

Photo by Joshua Sortino on Unsplash

Introduction

CRISP-DM is a widely used framework for data mining that outlines a structured approach to planning, executing, and evaluating data mining projects. It provides a step-by-step process that can be adapted to various business domains and data mining techniques, making it a valuable tool for both beginners and experienced practitioners.

In this article, we will learn an overview of the CRISP-DM framework and the key steps involved in the process. We’ll also explore how this framework can help you to achieve your data mining objectives more effectively and efficiently, and highlight some of the challenges that you may encounter when applying the CRISP-DM framework. So, let’s get started and learn how to use CRISP-DM to take your data mining projects to the next level!

Table of contents:

  1. The history behind the CRISP-DM process model
  2. CRISP-DM methodology: Explained
  3. What is the data mining context?
  4. Conclusion

The history behind the CRISP-DM process model

The CRISP-DM process model was first conceived in late 1996 by three industry-leading organizations in the young and immature data mining market. DaimlerChrysler, SPSS(then ISL), and NCR were the ones who came up with the need for such a process model, and they had already started or established data mining services since 1990.

Around the same time back in the late 1990s, early market interest in data mining was showing signs of exploding into widespread uptake. During this time need for a standard process model to efficiently harness the value of data emerged, and a process model that is sufficiently mature enough to be adopted as an organization’s essential business process.

In 1997, veterans from the three organizations mentioned above, formed a consortium and invented an acronym CRoss-Industry Standard Process for Data Mining, and started taking inputs from a wide range of industry practitioners and others such as data warehousing vendors or management consultancies with a vested interest in data mining.

After this, they launched a one-day-long workshop in Amsterdam to invite interesting parties, share ideas and openly discuss how to take CRISP-DM forward. The workshop went beyond their expectations as twice as many people turned up, and there was an overwhelming consensus that the industry needed a standard process for data mining.

Over the next two and half years, they developed and refined the CRISP-DM process model with live trials on large-scale data mining projects at Mercedes-Benz and their insurance sector partner, OHRA. by mid-1999, they produced a good quality draft of the CRISP-DM process model. They validated the process model on a set of projects already adopted by SPSS and NCR’s professional services groups.

Owing to the growing need for such a model and with sufficient proof, the CRISP-DM model was ready to be published and distributed by mid-2000, followed by the widespread adoption of the model across industries.

crisp-dm process model (source: datascience-pm.com)

CRISP-DM methodology: Explained

We saw a visual depiction of the CRISP-DM process model in the previous section. CRISP-DM process model for data mining provides an overview of the lifecycle of the data mining projects. It contains the phases of a project, their respective tasks, and the relationships between them. By simply looking at the description of each task, It is not possible to identify the relationship between each task, as it depends upon the specific problem and interest of the users

The life cycle of a data mining project consists of six phases as we saw in the previous lesson. The sequence of the phases is not rigid; moving back and forth between different stages is always required when applying the process models in real-world projects. The outcome or result of each phase determines which phase, or particular task of a phase, has to be performed next, as each task is interdependent in many cases. For example, if our predictive model gives less accuracy than expected, then we may go back to see our data understanding and data preparation steps. Thereafter we redevelop the predictive model to improve its accuracy.

If you have noticed, the outer circle in the CRISP-DM process model symbolizes the cyclical nature of data mining itself. Data mining does not end once a solution is deployed. The lessons learned during the process and from the deployed model solution can trigger new, often more focused business questions or new objectives. We will now learn briefly about each phase in the CRISP-DM process model.

  1. Business problem understanding

This initial phase focuses on understanding the project goals, objectives, and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan with a specific set of tasks and desired outcomes to achieve project-level objectives.

2. Data Understanding

The data understanding phase starts with initial data collection and proceeds with activities such as feature description, primary data analysis, and exploratory data analysis that enable you to become familiar with the data, identify the data quality problems such as missing values, inconsistent data entries, and/or identify compelling subsets to form a hypothesis regarding confidential information.

3. Data Preparation

The data preparation phase covers all activities needed to construct the final dataset fed into the modeling tools from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include data table, feature selection, feature engineering, as well as feature transformation and cleaning of data for modeling tools and techniques.

4. Modeling

In this phase, various modeling techniques depending on the problem statement, are selected and applied, and their parameters are calibrated to find optimal modeling performance. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements for the format of data it needs. Therefore, going back to the data preparation phase is often necessary at this stage.

5. Evaluation

The evaluation phase is one of the most crucial phases of any data science project lifecycle. At this stage in your project, you must have built machine-learning models. Models might be performing well on your training data, but it is necessary to test and evaluate it on unseen data for the model to achieve its objectives. Appropriate evaluation metrics are measured and well-tested at this stage. At the end of this phase, a decision on using the data mining results should be reached.

6. Deployment

The creation of the model is generally not the end of the project. Even if the purpose of the model is to analyze the data and increase the understanding of the data, the knowledge or insight gained from the modeling needs to be presented so that the end users of the model can use it. End users could be operational-level staff, business executives, or customers as well. It often involves applying live models within an organization’s decision-making processes-for example, real-time personalization web page, product recommender system, or scoring of marketing leads. Depending on the requirements, the deployment phase can be as simple as generating a dashboard, or as complex as implementing a repeatable data mining process across the enterprise.

What is the data mining context?

We learned about the standard data mining process model, which is widely applied across industries and various use cases by its design. Each phase contains a specific set of generic tasks; some specialized tasks depending on the problem case and project requirements. It is also essential to understand that the standard CRISP-DM process model’s subtasks vary widely as per the context of the project might be. For example, if you are working on a project in the healthcare industry, then you really need to understand and gain background knowledge about problems, use cases, and end-user requirements in that specific context.

The data mining context drives mapping between generic tasks of each phase and specialized tasks during the CRISP-DM process. Let’s look at the four different dimensions of data mining context:

  • The application domain is the specific area in which the data mining project takes place.
  • The data mining problem the type describes the specific objectives and goals that a data mining project deals with
  • The technical aspect covers specific issues in data mining that describe different challenges that usually occur during the data mining lifecycle.
  • The tools and technique dimension specifies which data mining tools and/or techniques are applied during the data mining project.

Conclusion:

In conclusion, the CRISP-DM framework is an invaluable tool for anyone looking to undertake a data mining project. Its structured approach to planning, executing, and evaluating such projects provides a clear roadmap for success. By following the CRISP-DM process, data miners can ensure that their projects are well-defined, well-executed, and well-documented.

As we have seen, the CRISP-DM methodology consists of six main components, each of which plays a vital role in the overall process. These components include business understanding, data understanding, data preparation, modeling, evaluation, and deployment. When combined, these components provide a comprehensive and iterative process for data mining that can be adapted to various contexts.

It’s worth noting that the CRISP-DM framework has undergone various revisions and updates since its inception in the late 1990s. This is because data mining techniques and technologies continue to evolve, and the framework needs to keep pace with these changes. Nevertheless, the core principles of the CRISP-DM methodology remain as relevant today as they were when the framework was first developed.

In summary, if you’re looking to undertake a data mining project, the CRISP-DM framework is an excellent starting point. By following its structured approach, you can increase your chances of success and avoid many of the pitfalls that can arise in data mining projects. So why not give it a try and see how it can help you to achieve your data mining goals?

--

--