Standard compliant data projects

A tale about applying standards to data projects and if it’s necessary.

Carsten Sandtner
Nov 20, 2020 · 4 min read
Image for post
Image for post
Photo by Philipp Mandler on Unsplash

Right before the current hype about Data and Data Science the industry was already faced with data related projects. The need for standardization grew. About 23 Years ago a first attempt for a standard has been defined. Data projects can get complex and it’s necessary to establish a process for tackle them. Let’s take a look at some common standards and how they could help to structure your data science projects.

CRISP-DM

By Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610
By Kenneth Jensen — Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610

The inner process is sequential. Nevertheless, it is often necessary to switch back and forth between different phases. Let’s take a look at the six steps.

  1. Business Understanding
    At the beginning you define goals and requirements. What do you want to achieve with your project?
  2. Data Understanding
    Collect and understand existing data. At this phase you could identify problems with your data or with quality of data.
  3. Data Preparation
    Self-explaining phase: Prepare and clean your existing data for your desired models and goals of your projects.
  4. Modelling
    Create your models and optimizing parameters. Usually in this step, more than one model is being created.
  5. Evaluation
    In this step you evaluate which model may fit best for your current goal and requirements. It is necessary to check with your initial goal to be sure to match requirements.
  6. Deployment
    In this final step, you „deploy“ your results. Could mean you have a presentation or a deliverable system using your model. It depends on your goals.

All phases should start over when your first model is deployed. Data changes, goals are adjusted etc. This is what the circle around the representation of the model tries to visualize. You can also see some phases having feedback to others. E.g. Business understandingData understanding. This means you should revalidate if you find out your data could not help to fulfil your goals. Keep in mind iterative development like Scrum was first mentioned by Ken Schwaber in 1995. And the first book came out around 2001. CRISP-DM is surprisingly close to Scrum. But it is still sequential/V-Model-like in its core. Only the whole process itself is iterating. That put IBM on the scene — 19(!) years later.

IBM revised the process

Image for post
Image for post
Analytics Solutions Unified Method (ASUM) Process Model.

IBM proposed a process adapted to modern requirements. Their process has five phases supported by a continuous project management stream. The phases are not strictly chronological as in CRIPS-DM. They can also be run through several times or go back to other phases, depending on your application. It is based on CRISP-DM extended with tasks and activities on infrastructure, operations, project, and deployment, and adds templates and guidelines to all the tasks.

  1. Analyze
    As in CRISP-DM, you define your goals and requirements first.
  2. Design
    Defining components, development environments and needed Resources to complete the task
  3. Configure & Build
    The needed components are gradually implemented and tested. At his step you develop models and test them.
  4. Deploy
    Integrate the developed components in your final environment.
  5. Operate and Optimize
    Continuous optimization is important which could lead into new requirements.

Like Scrum, ASUM is more like a framework instead of a fixed process or standard.

IBM has another idea how to define a process for data mining/-modelling

IBM DataFirst Method

Again a five-step process with self explaining phases relying heavily on their Cloud Garage Method.

Image for post
Image for post
IBM DataFirst Method Process Model. Source: IBM Corporation

Which method should I use?

It’s always good to know what methods or standards are available. Applying them to every project without adaptations won’t work. These lessons have already been learned in the era of using V-Model. Always keep in mind: Each project is unique and has never been seen before.

Data & Smart Services

Collection of data driven articles and cases

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store