Data Science workflow

Sourabh Potnis
3 min readFeb 17, 2022

--

Photo by Faris Mohammed on Unsplash

In the previous post, we discussed the current Data Science landscape.

Data Science is about solving the business problem to bring value.

Modelling is at the core of Data Science/Machine learning but it is just part of the puzzle. To solve the entire puzzle successfully, equally its important to know and execute what happens before and after Modelling i.e. Business/Data Understanding, MLOps and Monitoring.

In this post, we will discuss the Data Science workflow and its steps which are iteratitive in nature and not like a waterfall.

Data Science workflow

Traditionally, CRISP-DM(CRoss Industry Standard Process for Data Mining) and SEMMA(Sample-Explore-Modify-Model-Assess) are used as standard frameworks for Machine learning and Data Science. With the Data Science landscape and problems getting bigger and complex day by day, there is a need to tweak and evolve these frameworks. Along with Modelling, equal importance needs to be given to Business/Data Understanding, MLOps, Monitoring and iterative feedback & improvement process.

Business understanding and problem defination—

As a data scientist, you need to understnad the needs and pain points (that can be competitive, organizational, financial and/or operational in nature) of the client/stakeholders. Apply structured, hypothesis driven thinking to framing the business problem and to convert it to a machine learning problem.

Data understanding and exploration —

Once business and ML problem is defined, next step is to identify and gather data, understand schema and meaning and then explore it.

Data preparation —

Data preparation includes joining data from multiple sources, data cleaning such as handling missing/wrong values & outliers, data normalization/standardization, handling imbalanced data, feature engineering, etc.

Modelling —

Based on the problem at hand we create the Machine learning model that is descriptive / predictive / prescriptive in nature using supervised / unsupervised / reinforcement based learning.

Evaluation —

All the created models need to be evaluated using evaluation criterias, fine tuned and finalized.

Deployment/MLOps —

Business value can be generated if model is deployed in Production and users are actively using it.

Monitoring —

Once deployed, you need to monitor different levers of data, pipelines as well as model.

Feedback —

You need to integrate the feedback from client/stakeholders iteratively at each step of the workflow. It makes sure that we are hearing all the stakeholders and not just solving what we think is the problem.

Audit, Fairness, Privacy, Compliance, Regulations —

These aspects are very important that you shoud consider at each step of the workflow as it impacts the choice of your model/solution and deployment.

Presentation —

Presenting the problem definition, ideas, solution, analysis, model, insights should be done to show the business imact/value using charts and quantitative numbers wherever possible. Your analysis and insights should be actionable and easy to understand.

In next chapter, we will deep dice into Business understanding and defining a problem.

--

--