CRISP-DM Phase 2: Data Understanding

Zipporah Luna
Analytics Vidhya
Published in
3 min readAug 5, 2021

--

This is part 3 of the 7-part series’ summary explanation of the openSAP’s 6-week Getting Started with Data Science (Edition 2021) course by Stuart Clarke. Part 2 is here.

Part 2 Recap

In the second part of this series, I explained why it is important to have a business understanding before starting a data science project as it will be one of the main drivers to build your project.

There are six phases of CRISP-DM with particular tasks and output:

Six phases of CRISP-DM:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

In this third part of the series, we will focus on the second phase which is Data Understanding. After having a business understanding, you need to understand what data do you have, where to get the data, what is in your data, and if your data is of quality.

Why do you need to understand your data?

In the business understanding phase, the fourth task which is the Produce Project Plan covers describing the resources needed. In this phase, you need to identify already what data you have, where to get the data, what tools to use to get the data, and how much data available is crucial. Understanding your data from the initial phase will make your data science project more sense.

Imagine you are tasked in predicting house prices and you do not know where to get the data, what approach will you take? What if the data only has 50 records and the records are mostly commercial spaces? Will it be appropriate to use and will it be enough?

Always remember, before diving in a data science project, always know and understand first WHAT data to acquire, WHERE to acquire, HOW to acquire, and HOW MUCH data to acquire.

The Breakdown

In the course, Stuart have broken down the tasks and output of the second phase in detail. *see below

In the process flow above, Data Understanding is broken down into four tasks together with its projected outcome or output in detail.

Simply put, the Data Understanding phase’s goal is to:

  • Collect Initial Data or acquire the data and its access to the data listed in the projects resources. Collecting initial data also means you need to have a checklist of the dataset you have acquired, the dataset location, the methods to acquire the datasets, and record any problems encountered and any solutions to the problems for the other users or project members to be aware of.
  • Describe Data by examining the properties of the data acquired, provide a description report regarding the format of the data, quantity of data and even the records and fields in each table or datasets.
  • Explore Data by using data science questions that can be quickly answered through querying, visualization, and reporting or summary report. In this stage, you will be able to find your first or initial hypothesis and their impact on the project.
  • Verify Data Quality by examining if the data is complete. If the data has errors or are there missing values and if there is, what is the percentage of the missing values versus the overall data obtained.

In the next part, we will talk about the third phase which is the Data Preparation. If you are working on a data science project for your company or even personal project/s, try to apply the above steps if applicable. As again, different data science projects have different sets of requirement. The CRISP-DM methodology just serves as a template to ensure you have considered all of the different aspects specific to your project.

--

--

Zipporah Luna
Analytics Vidhya

Data Analyst | Markets & Competitor Insights | Market Researcher | Football Enthusiast