Learn Data Science using CRISP-DM Framework

Durgesh Anand
Analytics Vidhya
Published in
5 min readJul 18, 2020

If you’re interested in the exciting world of data science, but don’t know where to start, CRISP-DM Framework is here to help.

Let's see one by one how these steps play a vital role in learning data science in flow~

Step 1 Business understanding:

The first stage of the framework is to develop a business understanding. For this, you have to carry out two steps:

  1. Determine the business objective -Determining the business objective is of high importance. Until the business objectives have been finalized, the data cannot be collected or worked upon.
  2. Identify the goal of the data analysis- Determine the goals of data analysis, because without goal project is of no use.

Same way learners have to select the Business domain in which they are planning to become a data scientist and then try to become SME of that particular domain. Without business domain knowledge they unable to meet the Step1 requirements and eventually would face difficulties in other steps. Once we know the business we can easily apply our data science skills to drive more value out of that business data.

Some trending domains: Healthcare, Fintech, Real Estate, E-commerce, EduTech etc.

Step 2 Data Understanding

This stage comprises of two key steps to understand the available data and identify new relevant data in order to solve the business problem.

  • Describe data- Once you have identified the data set, you need to describe its contents and explore insights to better understand the data and its business implications. To describe the data, we can create a data dictionary that lists down the types of variables (e.g. sectors, company names, etc.), the number of records, and the types of analysis.

For describing data learner needs to know one programming language(R ,Python or sas) and excel so that they can easily do first-level analysis and then can create data dictionary.

  • Explore data- To explore data, you can plot simple graphs on Excel/R/Python, e.g. to understand the trend in data or to get a graphical representation of data for better understanding to get useful insights.

For exploring data learners needs to know EDA -Exploratory data analysis which could be done with the help of statistics knowledge.

Step 3 Data Preparation

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”

Data preparation is the most important and time-consuming step in this data needs to be prepared by doing some data preprocessing like data transformation, aggregation etc. We can create new attributes using our existing here new attributes are called derived attributes, eg. deriving age from dob etc.

The data preparation has various rigorous steps including the following:

  • Filling missing data
  • Removing data
  • Transforming data

Step 4 Data Modelling

Modelling is the heart of data analytics. Modelling activity in the CRISP-DM framework involves two major tasks. The first task is to understand the problem domain and select the appropriate family of models that is suitable for solving the problem at hand. The second task is to select appropriate algorithms for creating the model from the chosen family of models.

It is performed in the following manner:

  1. Selection of modelling technique is the very first step to take
  2. It is followed by the generation of a test scenario for validating the model’s quality.
  3. After that few more models are generated.
  4. All the models are then assessed to make sure that they fall in line with the business initiatives.

For Data modelling learner have to learn different types of supervised or unsupervised machine learning algorithms with practical use case which will help them to apply the suitable modeling technique to handle their business problem.

Step 5 Model Evaluation

The predictive models can be tested to assess their effectiveness in solving the problem. This is the fifth stage of the framework — model evaluation. Modelling and evaluation together is an iterative process in which the models are tweaked until satisfactory evaluation results are obtained.

For Model Evaluation learner needs to learn the following techniques to evaluate the model, some statistical techniques used are Confidence Interval, Confusion Matrix, Gain and Lift Chart, Chi-Square, ROC curve, Root Mean Square Error, R² etc.

Step 6 Model Deployment

“The goal is to turn data into information, and information into insight.”

This is the last stage of the framework, where the model is translated into a business strategy. Business data is fed into the model and the model results are used to inform business decisions on an on-going basis. The CRISP-DM framework does not end at the last stage of model deployment. The important thing to note is that CRISP-DM is an iterative process. For example, your data understanding can enhance your business understanding. Similarly, after model evaluation, if the model does not perform great, you will have to go back to the data preparation stage, and then develop the model again.

For deploying model learner needs to learn model deployment tools used like a flask, AWS cloud , GCP , Kubernetes and DevOps techniques.

Wrapping Up

Mapping six steps of CRISP-DM framework with learning stages in data science.

  1. Business Understanding- Learn Business domain
  2. Data Understanding-Learn Statistics-Descriptive & Inferential, Data Visualization.
  3. Data Preparation-Learn Mathematics for ML, Data Transformation, Aggregation techniques.
  4. Data Modelling- Learn different types of supervised or unsupervised machine learning algorithms
  5. Model Evaluation-Learn Confidence Interval, Confusion Matrix, Gain and Lift Chart, Chi-Square, ROC curve, Root Mean Square Error, R² etc.
  6. Model Deployment- Learn flask, AWS cloud, GCP, Kubernetes and DevOps techniques etc.

--

--