The CRISP-DM in AI/ML

Michael Hannecke
Bluetuple.ai
Published in
7 min readJul 19, 2024
CRISP-DM — how Dall-E looks at it

Introduction

In a typical Machine Learning or GenAI project, there are countless parameters and steps that need to be taken into account. This makes it all the more important to have a structured process that guides us through the entire model lifecycle and helps to ensure that the right information and options for action are implemented at every step.
Here comes the CRISP-DM process in handy. A process that was originally developed in the BigData world, but which is also ideal for structuring ML/AI projects

The Cross-Industry Standard Process for Data Mining (CRISP-DM) isn’t just another acronym in the AI world. It’s a proven methodology that can dramatically improve your ML project outcomes. Here’s why you should care:

  1. Boost Success Rates: CRISP-DM can increase your project success rate by providing a clear, structured approach from start to finish.
  2. Save Time and Resources: By following a standardized process, you’ll avoid common pitfalls that often lead to wasted effort and budget overruns.
  3. Bridge the Business-Tech Gap: CRISP-DM helps translate vague business objectives into concrete data science tasks, ensuring your ML projects deliver real value.
  4. Enhance Scalability: Whether you’re working on a small prototype or a company-wide initiative, CRISP-DM scales to fit your needs.
  5. Improve Collaboration: With a shared framework, teams can communicate more effectively, reducing misunderstandings and silos.

In this article, we’ll break down the six phases of CRISP-DM, showing you exactly how to implement this powerful methodology in your own projects. By the time you finish reading, you’ll have a practical toolkit to tackle any data challenge with confidence and precision.

Eager to transform your ML projects from hit-or-miss experiments into strategic, results-driven initiatives? Let’s dive into the world of CRISP-DM and unlock your data’s true potential.The CRISP-DM Process

CRISP-DM (created with napkin)

The CRISP-DM (Cross-Industry Standard Process for Data Mining) process model is a robust and established methodology for planning, organizing, and implementing machine learning (ML) projects. I’ll outline the six phases of the CRISP-DM process model, providing a structured approach to solving business problems using data-driven techniques.

6 Phases of the CRISP-DM Process Model

Let’s dig a bit in each of the six phases:

Phase 1: Business Understanding

The first phase focuses on understanding the problem and defining the project goals.

Phase 1: Business-Understanding (created with napkin)
  • Identifying the business objectives: This step is all about figuring out what the company wants to achieve. For example, a streaming service might want to reduce customer churn by 20% in the next quarter.
  • Assessing the situation: We have to take stock of what resources we have, what constraints we’re dealing with, and any assumptions we’re making. In our streaming service example, we might assess our current customer data, available computing power, and budget constraints.
  • Determining the data mining goals: This is where we translate business goals into specific data mining objectives. For our streaming service, the goal might be to build a model that can predict which customers are likely to cancel their subscription in the next month.
  • Producing a project plan: Finally — as the last sub step in the initial business understanding phase, we create a roadmap for the project, including timelines, resource allocation, and key milestones. This might include setting deadlines for data collection, model development, and implementation.

Phase 2: Data Understanding

This phase starts with initial data collection and proceeds with evaluating the data needed to solve the problem.

Phase 2: Data Understanding (created with napkin)
  • Collecting initial data: This involves gathering all relevant data from various sources. Our streaming service might collect user viewing history, account information, and customer support interactions.
  • Describing the data: Here, we examine the data’s properties, such as the number of records, field types, and value distributions. We might discover that we have 1 million user records with 50 different attributes per user.
  • Exploring the data: This step involves diving deeper into the data to uncover insights. We might find correlations between viewing habits and subscription length, or identify patterns in when users tend to cancel.
  • Verifying the data quality: We check for issues like missing values, outliers, or inconsistencies. For instance, we might discover that 5% of our user records are missing age information, which we’ll need to address.

Phase 3: Data Preparation

In the data preparation phase, the data is prepared for use in modeling.

Phase 3: Data Preparation (created with napkin)
  • Selecting the data: We choose which data to use based on its relevance to our goals and its quality. We might decide to focus on the last year of viewing history rather than the entire dataset.
  • Cleaning the data: This step addresses quality issues identified earlier. We might impute missing age values or remove outliers that could skew our results.
  • Constructing the data: Here, we derive new attributes that might be useful. For our streaming service, we could create a “binge-watching score” based on how often a user watches multiple episodes in one sitting.
  • Integrating the data: If we’re using multiple data sources, we combine them into a cohesive dataset. We might merge viewing history data with customer support logs.
  • Formatting the data: Finally, we organise the data in a way that’s suitable for our chosen modelling techniques. This could involve normalising numerical values or encoding categorical variables.

Phase 4: Modeling

Several ML techniques to solve the problem are then selected and developed in the modelling phase.

Phase 4: Modelling Phase (created with napkin)
  • Selecting modeling techniques: We choose appropriate algorithms based on our problem and data. For predicting churn, we might consider logistic regression, random forests, and gradient boosting machines.
  • Designing test scenarios: We create a plan for how we’ll train and evaluate our models. This could involve splitting our data into training and testing sets, and deciding on evaluation metrics like accuracy and ROC AUC.
    Building models: We implement and train our chosen models. This might involve experimenting with different hyperparameters to optimize performance.
    Assessing models: We evaluate how well each model performs on our test data. We might find that the gradient boosting machine outperforms the other models in predicting churn.

Phase 5: Evaluation

The evaluation phase evaluates the performance of the models and selects the best one.

Phase 5: Evaluation (created with napkin)
  • Evaluating results: We assess whether our model meets the business objectives we set out in the first phase. Does our churn prediction model actually help us identify at-risk customers accurately enough to be useful?
  • Reviewing the process: We look back at each step of the process to see if we missed anything or if there are areas for improvement. Maybe we realize we should have included data on promotional offers in our model.
  • Determining the next steps: Based on our evaluation, we decide whether to move forward with deployment, iterate on our current approach, or start over with a different strategy.

Phase 6: Deployment

The deployment phase implements the selected model in production.

Phase 6: Deployment
  • Planning deployment: We create a strategy for integrating our model into the business’s operations. This might involve working with the IT team to set up an automated system that flags at-risk customers for the retention team.
  • Monitoring and maintenance: We set up systems to track the model’s performance over time and update it as needed. We might schedule monthly reviews to check if the model’s accuracy is holding up as user behavior evolves.
  • Producing a final report: We document the entire project, including our methodology, findings, and recommendations. This report serves as a reference for stakeholders and future projects.
  • Reviewing the project: Finally, we assess the project as a whole, identifying what went well and what could be improved for future ML initiatives.

Conclusion

As we’ve seen, the CRISP-DM Process Model provides a comprehensive framework for tackling AI/ML projects. Its six phases — from Business Understanding to Deployment — offer a structured approach that can significantly improve the chances of project success.

By following this methodology, you can significantly improve the chances of success in your ML initiatives, bridge the gap between business goals and technical implementation, and create a repeatable process for future projects.

Remember, while CRISP-DM provides a valuable framework, its true power lies in how you adapt and apply it to your unique challenges. As you embark on your next ML project, let CRISP-DM be your guide to turning data into actionable insights and driving real business value.

If you have read it to this point, thank you! You are a hero (and a Nerd ❤)! I try to keep my readers up to date with “interesting happenings in the AI world,” so please 🔔 clap | follow

--

--