Data & Smart Services
Published in

Data & Smart Services

Carsten Sandtner

Nov 20, 2020

4 min read

Standard compliant data projects

A tale about applying standards to data projects and if it’s necessary.

Photo by Philipp Mandler on Unsplash

CRISP-DM

In 1996, a consortium of five companies introduced a process model for data mining projects. Back in these years no one talked about data science. The model is called CRISP-DMCRoss-Industry Standard Process for Data Mining. They split a project into six phases. The idea is iterating these phases.

By Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610
By Kenneth Jensen — Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610
  1. Data Understanding
    Collect and understand existing data. At this phase you could identify problems with your data or with quality of data.
  2. Data Preparation
    Self-explaining phase: Prepare and clean your existing data for your desired models and goals of your projects.
  3. Modelling
    Create your models and optimizing parameters. Usually in this step, more than one model is being created.
  4. Evaluation
    In this step you evaluate which model may fit best for your current goal and requirements. It is necessary to check with your initial goal to be sure to match requirements.
  5. Deployment
    In this final step, you „deploy“ your results. Could mean you have a presentation or a deliverable system using your model. It depends on your goals.

IBM revised the process

2015 IBM published their idea for a standard Data Mining and Predictive Analysis process: ASUM-DMAnalytics Solutions Unified Method for Data Mining/Predictive Analytics.

Analytics Solutions Unified Method (ASUM) Process Model.
  1. Design
    Defining components, development environments and needed Resources to complete the task
  2. Configure & Build
    The needed components are gradually implemented and tested. At his step you develop models and test them.
  3. Deploy
    Integrate the developed components in your final environment.
  4. Operate and Optimize
    Continuous optimization is important which could lead into new requirements.

IBM DataFirst Method

Based on IBM Cloud Garage, their DataFirst Method targets at IT transformation to get infrastructure, processes, and employees ready for AI.

IBM DataFirst Method Process Model. Source: IBM Corporation

Which method should I use?

The question no one could answer for you. It really depends on your project. In my opinion, you should use what your project needs. As in Scrum there are parts of the framework you should not dismiss. E.g. defining your goals and analyze your data. The key to a successful data science project is having good and reliable data. The most time will be spent for analysis and exploration of data. You should definitely iterate and don’t hesitate to go back in your process when needed. Revalidate often. Maybe business goals or data will change and a better model could help. This is why a data-based project is never finished. You always have to revalidate models, requirements and data sources.