Project Management in Data Science using CRISP-DM

Mukesh Kumar
Accredian
Published in
8 min readApr 21, 2022

Preface

When you and your team prepare to execute a project, mainly data science-oriented, a project lifecycle plays a crucial role. Fuzzy and half-thought planning can lead to a disaster. That’s why companies look for a workflow plan to execute their project efficiently and effectively. In addition, having an idea of the overall flow (with clarity) helps team members to segregate work amongst themselves effectively with ease.

Even though there are many frameworks to execute a data science-oriented project, CRISP-DM is one of the popular frameworks used in the industry.

Phases of CRISP-DM Lifecycle

CRISP stands for CRoss Industry Standard Process in Data Mining. It has six phases that inherently describe the project execution using data science.

These phases are described at high-level as follows:

  1. Business Understanding — What does the business require?
  2. Data Understanding — What data do we have/need? Is it clean?
  3. Data Preparation — How do we systematize the data for modeling?
  4. Modeling —What modeling strategies should we use?
  5. Evaluation — Which model(s) best suits the business objectives?
  6. Deployment — How to provide results access to the stakeholders?

Next, I will walk you through these phases one by one as follows:

Phase—I: Business Understanding

In this first phase of the lifecycle, we mainly focus on understanding the problem statement, objectives, and project requirements. We document these requirements and develop flowcharts or project diagrams to have the initial guess of the project. This phase is crucial for a data science project because failing to comprehend the goals and objectives can lead to a disaster (cost heavily to the investors and professionals).

We investigate the business objectives and understand the customer needs thoroughly. Then based on the research and analysis, we define the business criteria. Next, we explore the resource availability, the requirement, and risk associated with the project and the associated cost. In addition, there must be a mental illustration in mind (from a technical perspective) about the flawless execution of the project. Finally, we produce a detailed project plan by selecting appropriate technologies (currently available or provided) for separately individual phases.

Phase — II: Data Understanding

In the second phase of the lifecycle, we focus on recognizing, accumulating, and examining the datasets that may help achieve the project goals.

We load the data into our chosen technology and examine their properties such as format, size, and reliability. We also try to visualize the feature characteristics among the data. Finally, if we encounter any issues with the data integrity, we document it down.

Phase — III: Data Preparation

In the third phase of the project lifecycle, we prepare the datasets for modeling, also referred to by the name data munging. Data scientists and analysts spend most of their time (as a thumb rule, almost 80%) preparing the datasets for analyzing and modeling the behavior.

We determine which datasets to take into account and which ones to discard. Based on this decision, we proceed with the data cleaning part. Professionals often view it as the lengthiest part of the task because computers work on garbage in and garbage out mechanisms. We generally analyze the behavior of the data and perform corrections and imputations if required. In addition, we may also proceed with developing new features from the existing ones that are helpful in the modeling. If the chosen datasets are insufficient, we may proceed and integrate data from different sources. Afterward, you may need to pre-process the data again if not ready for the next phase.

Phase — IV: Modeling

It is the fourth phase of the lifecycle, where we develop and assess several models based on the type of the problem. This phase is regarded as the most exciting and quickest phase of the project because we can execute the modeling in just two to three lines of code. We perform several iterations to determine the best model(s) considering our requirements.

We initiate by choosing the modeling methods suitable best to the problem statement. Next, we split the data into training, testing, and validation sets and proceed with developing models with hyperparameter optimizations if required. Finally, once the models are ready, we prefer to go with the best ones based on the pre-defined success criteria and the test design.

Phase — V: Evaluation

The evaluation phase is more elaborative than estimating technical tasks when assessing the best selection among developed models. We find the best need-based model while keeping a lid on the current and the future business objectives.

We assess whether the model satisfies the business success criteria or not. We also try to see if the model we have developed should be approved for the business or not. We cross-check things to observe if we overlooked anything. If yes, then outline the findings and rectify them. Based on the previous phases, determine whether to proceed to the deployment phase or not, require the next iteration, or initiate new projects.

Phase—VI: Deployment

In the final phase of the lifecycle, we deploy the model that suits the business needs best over the production. It is useless if the model results are inaccessible to the customers. The complexity of this phase may vary because it requires planning, monitoring, maintenance, and generating summaries.

We devise and document a detailed strategy for deploying the model, monitoring, and maintenance plans to avoid operational issues. The team members create a detailed project summary as a final presentation for the stakeholders and describe actionable insights for the future if applicable.

Comparison of CRISP-DM with Agile & Waterfall

Waterfall Methodology vs CRISP-DM

From one point of view, many people consider CRISP-DM as a strict waterfall process because of its reporting provisions. Some reportings are excessive for the majority of the projects. Moreover, the business understanding stage tells us that the project plan requires a detailed strategy for each phase, an aspect of traditional waterfall approaches. If the CRISP-DM is followed rigorously without frequent updations, then the process will closely correspond to the waterfall process. For example, consider that there is a project to forecast loan delinquency with two aspects, delinquent and non-delinquent. We can use CRISP-DM as a Waterfall process (slicing horizontally) as shown below:

The team will work thoroughly and horizontally to travel across each deliverable aspect at the individual phase. Team members can twist back seldomly to a lower layer if required. At the end of the project, the team will deliver one big boom to the stakeholders.

Agile Methodology vs CRISP-DM

From another angle, we can observe that the sequence of phases is not strict (we can move back and forth), which indirectly endorses agile principles and practices. Moreover, the outcome of one particular stage determines the actions for the next stage, and if you adapt to rapid iterations, the approach will closely correspond to the agile procedure. For example, consider the same example as above. We can use CRISP-DM with agile practices (slicing vertically) as shown below:

The team members can work on delivering one of the aspects at a time by slicing up vertically. This way team can provide numerously fewer vertical releases and often request feedback as advances to the project completion.

In short, it is good to use agile practices whenever possible because stakeholders get results momentarily and provide appropriate feedback utilizing which you can adapt to your project plans. It will allow you to estimate the performance of your models earlier, and you can use that to construct appropriate decisions.

Pros/Cons of using CRISP-DM Lifecycle

Pros

  • The workflow is generalizable, and it provides robust and clear guidance for the project activities.
  • It helps in improving common sense when iterating over the project phases.
  • The workflow can be enforced without much training, change in corporate roles, or dispute.
  • An appropriate start with proper business understanding provides better steering of the project execution.
  • In the final phase, we can discourse on essential points concerning issues aroused with actionable insight and force the growth to supervision.
  • The workflow implementation is flexible and provides the gifts of agile principles.
  • We can initiate a project from zero knowledge and gain some deeper understanding (empirical knowledge) by repeating previous iterations.

Cons

  • Some people believe that CRISP-DM suffers from the same flaws as of Waterfall process and burdens quick iteration.
  • Each phase of the project is documentation heavy which slows down the execution of the objectives and outcomes.
  • It doesn’t consider solving problems that modern frameworks can face, and it is limited to small teams only.
  • It is hard to handle big data projects using the CRISP-DM methodology because of the big data’s four V (Volume, Variety, Velocity, and Veracity).

Alternative to CRISP-DM Lifecycle

You may find several other frameworks to use as an alternative to the CRISP-DM. They also help gain knowledge from raw data and iterate over the entire process giving back refined results if required. These frameworks are:

All these alternatives are almost similar with the same objective of solving business-oriented problems and gaining knowledge.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).

& That’s it. I hope you liked this traditional data science framework and learned something valuable.

Follow me for more forthcoming articles related to Python, R, Data Science, Machine Learning, and Artificial Intelligence.

If you find this read helpful, then hit the Clap👏. Your encouragement will catalyze inspiration to keep me going and develop more valuable content.

--

--

Mukesh Kumar
Accredian

Data Scientist, having a robust math background, skilled in predictive modeling, data processing, and mining strategies to solve challenging business problems.