Standard compliant data projects
A tale about applying standards to data projects and if it’s necessary.
Right before the current hype about Data and Data Science the industry was already faced with data related projects. The need for standardization grew. About 23 Years ago a first attempt for a standard has been defined. Data projects can get complex and it’s necessary to establish a process for tackle them. Let’s take a look at some common standards and how they could help to structure your data science projects.
In 1996, a consortium of five companies introduced a process model for data mining projects. Back in these years no one talked about data science. The model is called CRISP-DM — CRoss-Industry Standard Process for Data Mining. They split a project into six phases. The idea is iterating these phases.
The inner process is sequential. Nevertheless, it is often necessary to switch back and forth between different phases. Let’s take a look at the six steps.
- Business Understanding
At the beginning you define goals and requirements. What do you want to achieve with your project?
- Data Understanding
Collect and understand existing data. At this phase you could identify problems with your data or with quality of data.
- Data Preparation
Self-explaining phase: Prepare and clean your existing data for your desired models and goals of your projects.
Create your models and optimizing parameters. Usually in this step, more than one model is being created.
In this step you evaluate which model may fit best for your current goal and requirements. It is necessary to check with your initial goal to be sure to match requirements.
In this final step, you „deploy“ your results. Could mean you have a presentation or a deliverable system using your model. It depends on your goals.
All phases should start over when your first model is deployed. Data changes, goals are adjusted etc. This is what the circle around the representation of the model tries to visualize. You can also see some phases having feedback to others. E.g. Business understanding ↔ Data understanding. This means you should revalidate if you find out your data could not help to fulfil your goals. Keep in mind iterative development like Scrum was first mentioned by Ken Schwaber in 1995. And the first book came out around 2001. CRISP-DM is surprisingly close to Scrum. But it is still sequential/V-Model-like in its core. Only the whole process itself is iterating. That put IBM on the scene — 19(!) years later.
IBM revised the process
2015 IBM published their idea for a standard Data Mining and Predictive Analysis process: ASUM-DM — Analytics Solutions Unified Method for Data Mining/Predictive Analytics.
IBM proposed a process adapted to modern requirements. Their process has five phases supported by a continuous project management stream. The phases are not strictly chronological as in CRIPS-DM. They can also be run through several times or go back to other phases, depending on your application. It is based on CRISP-DM extended with tasks and activities on infrastructure, operations, project, and deployment, and adds templates and guidelines to all the tasks.
As in CRISP-DM, you define your goals and requirements first.
Defining components, development environments and needed Resources to complete the task
- Configure & Build
The needed components are gradually implemented and tested. At his step you develop models and test them.
Integrate the developed components in your final environment.
- Operate and Optimize
Continuous optimization is important which could lead into new requirements.
Like Scrum, ASUM is more like a framework instead of a fixed process or standard.
IBM has another idea how to define a process for data mining/-modelling
IBM DataFirst Method
Based on IBM Cloud Garage, their DataFirst Method targets at IT transformation to get infrastructure, processes, and employees ready for AI.
Again a five-step process with self explaining phases relying heavily on their Cloud Garage Method.
Which method should I use?
The question no one could answer for you. It really depends on your project. In my opinion, you should use what your project needs. As in Scrum there are parts of the framework you should not dismiss. E.g. defining your goals and analyze your data. The key to a successful data science project is having good and reliable data. The most time will be spent for analysis and exploration of data. You should definitely iterate and don’t hesitate to go back in your process when needed. Revalidate often. Maybe business goals or data will change and a better model could help. This is why a data-based project is never finished. You always have to revalidate models, requirements and data sources.
It’s always good to know what methods or standards are available. Applying them to every project without adaptations won’t work. These lessons have already been learned in the era of using V-Model. Always keep in mind: Each project is unique and has never been seen before.