CRISP-DS: Cyclic Methodology for Data Science Projects

Learn an excellent tool for planning and executing your next Data Science projects.

Duan Cleypaul
9 min readOct 31, 2021

Data Science is undoubtedly a growing expression in recent years. The market is heated, vacancies fill the inbox of those already employed, and signal hope for those seeking to pursue a career in the field.

Data is great, exciting, and new. But let’s face it: when you have 300+ tables with over 50.000.000 rows each, you start to feel confused with so many possibilities.

Source: https://studypoints.blogspot.com/2016/04/define-concept-guidance-in-detail.html

So how do I start? Which table? Which columns? What am I looking for? What am I trying to solve? Even the questions are many. So let’s calm down and see a way to make sense of this mess.

Data Science as a Science

Let’s think of Data Science as, in fact, a science. If you look at Wikipedia, science has one major characteristic: methodology. You must have a set of systematic steps to test, prove, and register your findings.

If you’re confused not knowing where to begin, don’t worry. There are a lot of methodologies and frameworks that help you develop a Data Science project. In this post, I’d like to share with you my favorite one, which guides nearly 100% of my DS projects: CRISP-DM.

CRISP-DM/DS

CRISP-DM (Cross Industry Standard Process for Data Mining), or CRIPS-DS (for Data Science), as I like to call it, is a cyclic methodology that helps you organize your thoughts and codes in a logical, straightforward way.

Source: CRISP-DM 1.0

Disclaimer: the methodology was conceived with the term “Data Mining” in the 90s, but I took the liberty to use it as “Data Science” for suiting purposes.

The original CRISP-DM has 6 phases, as described above. But as the demand for data rises, you can make some changes to fit your reality better. For example, check the adaptions below:

Some of the existing phases were divided, some combined, and others kept the same. Since we’re going to cycle through all these phases repeatedly, I’d like to explain each one and why the adaptions helped me better organize my code and projects.

CRISP-DS (adapted)

After the abovementioned changes, behold the magic of the adapted CRISP-DS:

The main idea is to cycle over the phases, so you have some value to add to the business. Once you deliver the first batch of value, you can iterate again, find new insights, adjust parameters, improve accuracy, and provide more value. Then you repeat again, and again, and again, as long as it makes sense.

This method is so good that I’m using it to organize things unrelated to Data Science or projects. But more on that later. Stick with me until the end, and I’ll explain it.

Now let’s see how we can use this method in our Data Science projects.

1. Business Problem

Source: irishtimes.com

It’s the first contact with the business area. There might be a lot of information given by the business area, but don’t panic yet!

Suppose you’ve worked on at least one Data Science project. In that case, you might have noticed one thing: people don’t usually tell you what the issue is, but the idea of what they think a solution would be for their problem. That is why we divided the original Business Understanding phase in CRISP-DM into two parts).

That might be a deal-breaker for some people, but it’s very common not knowing at first what the problem is. Don’t take this as a bad thing, though. Think of it as an opportunity.

When someone tells you what they want to have instead of what they want to solve, they are really telling you their expectations.

So what do you do in this case? You listen! Listen to their expectations so you can put everyone on the same page when defining the scope of the problem.

That’s your job for now. Identify the expectations and differentiate them from the real problem.

2. Business Understanding

Source: http://www.researchpointindia.com/

This phase is the journey from “Businessland” to “Dataville”, where we translate the case from a business problem to a data problem.

The business problem must be identified and defined in this phase. With that in mind, use and abuse of research. Google is your best friend here.

Remember that you might be working on a project of a different field of expertise than yours. In that case, Make Some Research, look for statistics, posts, videos, and any material that might give you a bit of context on the subject you are studying. Here are a few questions that might help:

  • What is the business model of the company?
  • How do they earn/lose money?
  • What are the main aspects that push the business forward?
  • Does any other company have a similar problem?
  • How did they solve it?
  • How does the problem behave in the company’s region? What about nationally? And globally?

After you have some background knowledge, it’s time to Raise Hypotheses that can be tested throughout the EDA phase and/or validated with model predictions. Create a mindmap, design structures, be creative, and do anything that might help you create hypotheses. They will guide your analysis once you have a list of things to prove.

Wanna know more about Rasing Hypotheses? Follow my page for upcoming posts!

3. Data Extraction

Ok! Now you are ready to see some data!

Source: tenor.com

At this phase, you’ll gather the available data, check the ways of accessing them, and retrieve the data.

But wait! What’s the hurry? It’s essential to check if you have enough processing power before frustrating yourself. That’s why we added a specific phase into CRISP-DS just for collecting the data.

Try to ask yourself: Can I process this data in my machine? Will I need cloud processing? Do I have the resources to process the gazillion rows of data? Imagine how much time you could save if you identified here that the project could not go on because of limitations in processing. While the company deals with that limitation, you could jump to another project or dive deeper into the Business Understanding phase and be productive.

After you have successfully extracted the data and identified that you could process it, you can consider collecting additional data (e.g., web scraping, public information, etc.). There’s no such thing as too much data.

4. Data Cleaning

Source: finereport.com

As obvious as it might be for Senior Data Scientists, it’s important to say that the data most likely will need some cleaning done.

Go ahead and establish standards and/or format the data in a way that is simple to read/understand and makes sense to the problem. Here are some usual tasks:

  • Casting
  • Lower / Upper case
  • Treat missing values
  • Derive new features from the existing ones (feature engineering)
  • Define the final dataset for the EDA phase

It’s fantastic that we have this phase in the CRISP-DS cycle because every time I have an idea of a new feature to derive from existing ones, or a treatment I forgot to do, I can always center these tasks in one specific section. That way, my code becomes more organized and intuitive.

5. EDA

Source: istockphoto.com

EDA stands for Exploratory Data Analysis. This is, in my opinion, the most important (and probably the most extended) phase of all in the CRISP cycle. I’ll post some more details on this topic on my page soon.

Here you will understand what each piece of information describes and how they relate to each other. It’s your chance to dive deeper into business knowledge and validate hypotheses.

Here are some aspects that you can explore in your data:

  • Perform some analysis: Descriptive, Univariate, Bivariate, and Multivariate;
  • Validate hypotheses from the Business Understanding phase;
  • Make a list of the main insights you’ve got from the data. You probably found something from the data that the business area doesn’t know.

6. Modeling

The Modeling phase is where you prep the data for the application of Machine Learning Algorithms.

Source: quora.com

Some people like numbers, and some others don’t. Machine Learning Algorithms love numbers. Not only that, but they hate strings! Let’s make their lives easier.

Apply scalers, encoders, and any other techniques needed, so the data fits nicely into the ML algorithms. Also, check the features you found irrelevant from the EDA phase and leave them out of the dataset that goes into the algorithms.

7. Machine Learning Algorithms

“You put the data in the algorithms and light the thing up.”

The main idea here is to apply cross-validation techniques, insert the data into the algorithms, and compare the results.

In the first cycle, simply apply the data to the algorithms and compare the results. You can tune hyperparameters, add cross-validation, and slowly improve its outputs as you iterate through the cycle.

Source: entrementes.com.br

REMEMBER: one of the reasons we use CRISP-DS methodology is to accelerate the frequency in which we generate valuable results to the business area. You can (and you will) improve the quality in future cycles.

8. Evaluation

After applying the data to the algorithms, you must select the appropriate metrics to evaluate the model and choose the best one.

Source: https://adamatti.github.io/blog/metricas/2018/07/28/metrics.html

Don’t forget that even though we translated the problem into a data problem, it still is a business problem. Find out how the business area tells apart the good solutions from the goofy ones and use it as part of your set of metrics.

If your model is accurate but doesn’t meet the business criterias, it’s NOT useful yet!

You can’t complete the first cycle without setting the metrics that evaluate your solution. Select the best metrics for the problem, know how to interpret the results, and check after each cycle if the changes you made improved or not the results.

You will see how fast and intuitive it will be to improve your solution after going through a whole cycle once.

9. Deployment

Yay! You did it! In this phase, you are ready to deliver significant value to the company. At this moment, you will plan the deployment, make the solution available for usage, and report what you did and learned.

I also prepared some excellent tips for this phase:

  • Plan, plan, plan! Then follow the plan;
  • Report what you did in a way that is useful for the business and for other Data Scientists that might lead this project in the future;
  • Register your experiences. What you fail now is an excellent lesson to your future self.

After that, repeat the cycle up until you die. I’m just kidding. But do it as long as you and the business area feel that there are relevant improvements to be made.

What now?

Now it’s your turn! Go ahead and apply this cycle to your current or next DS project.

OOPS! I almost forgot: remember when I said we could use CRISP-DS for things unrelated to data or projects? That’s because I used it to build this article.

Source: https://www.mememaker.net/meme/i-cant-believe-it/

Check this link where I uploaded the five cycles I went through to write this post. It took me about 45 to 60 minutes in each cycle, and I used five simple .txt files. At the end of the first one, I already had my post, but I wanted some improvements, so I added more cycles until I thought, “Ok, I can post this.”

If you liked this post, check out some other exciting things on Data Science, show me some support, and follow my page. Comments are always welcome!

--

--