CRISP-DM and How Different Data Engineering Roles are Part of It

Mete Can Akar
Plumbers Of Data Science
4 min readDec 25, 2022

In this article, I will talk about the standard approach used widely in the industry when managing data science projects which is Cross Industry Standard Process for Data Mining (CRISP-DM). I will also explain how different data engineering roles can be placed in the CRISP-DM approach.

Resource: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

The roles I will mention here are based on my understanding of the industry and they can be named differently depending on the organization. So don’t be angry with me if your organization uses a different terminology.

There are mainly 2 different data engineering roles within the CRISP-DM:

  • The first one is the cloud/platform data engineer who is more focused on managing the platform. They mostly handle security, configuration and resource management related topics of the platform. Also some of these data engineers govern the data on an on-premise or cloud platform and make it available using ETL processes to other product teams. They are mostly working in other teams within the same company. In the following sections, I will call this type data engineers “DE1” as an abbreviation.
  • The second one is a data engineer who works on a data/AI product. This role can be referred to as a data/software engineer for AI applications or a data engineer for data products or a product data engineer. These are the data engineers who are working closely with the data scientists within a data product team. Be aware that they can also be named sometimes as MLOps Engineers or ML Engineers depending on their focus and the company size. In the following sections, I will call these data engineers “DE2” as an abbreviation.

Now let’s take a look at how these data engineering roles and other roles are placed in the CRISP-DM.

  1. Data: Even though Data is not a process in the CRISP-DM diagram, it is in the heart of it. Therefore, it is worth mentioning. Data is ideally extracted, transformed and then loaded by a DE1 into a data storage e.g., Data Warehouse. But I have to mention that, if you are working at a start-up this can also be a task of a data scientist.
  2. Business Understanding: Ideally the whole team except the DE1 should have a very good understanding what the business problem is about. However, the focus is normally more on the people who are directly interacting with the product and the business unit. Depending on the organization this person can be a business analyst/product owner but also a data scientist/data analyst can take over this role.
  3. Data Understanding: Again, ideally the whole team should have a solid understanding of the data. In this step, a data scientist and a DE2 might need to sit down together with a DE1 and request more data if needed. It is also useful if the data product team can build some validation checks to determine whether the data is usable or not. If the data is not usable then the DE1 should revise their ETL pipeline.
  4. Data Preparation: Now the data is located somewhere e.g., a data warehouse, now it is time to do the transformations which are needed for the model training. These data preprocessing steps might include missing value handling, normalization, etc. At this step, a DE2 or a data scientist can be responsible.
  5. Modeling: As the name suggests, it is all about creating the model and training it on a training set. Here, mostly the data scientists are responsible but if the product is not an early stage PoC then a DE2 should also work together to design and create a maintainable/reliable data product.
  6. Evaluation: At this step, ideally, the whole team would work together, including the people from the business unit. If there are any problems with the product, they must be discussed. If the performance of the data product is not sufficient, then the team must head back to the business understanding step and try to figure out what has gone wrong. Be aware that at this step, some disagreements might arise.
  7. Deployment: At this step a DE2 is responsible. They make sure that the data product can run on multiple machines on an on-premise cluster or on cloud. Orchestration and scheduling is configured such as using Apache Airflow. Monitoring tools are set up (e.g., Splunk), in order to alert the team if the model performance decreases for some reason (e.g., models trained in pre-COVID didn’t work well during COVID) or if any other problem occurs so that the team can investigate the problem further. Even if there is no explicit problem with the product, DE2 should always monitor (automatically) the running costs of the data product. Especially, if it is running on cloud.

--

--

Mete Can Akar
Plumbers Of Data Science

Senior Data Engineer with DS/ML background. Follow me on https://www.linkedin.com/in/metecanakar/. Opinions are my own and not the views of my employer.