Data Scientists & the Six Sigma Philosophy

Karthik Vadhri
Intuition Matters
Published in
5 min readFeb 14, 2022

Data Driven decision systems not only radically transform businesses but also help target new markets, address customer pain points, boost revenue and much more. This narrative made data science the hottest job, with big data and business analytics market size valued at US$193.14 billion in 2019 and is projected to reach US$420.98 billion by 2027, growing at a CAGR of 10.9% from 2020 to 2027.

Photo by Myriam Jessier on Unsplash

There is another narrative, which has been gaining traction over the last couple of years. Multiple reports published by Gartner, CIO, Venturebeat.ai, have mentioned that about 85% of data science/big data projects fail.

The reason for this could be the fundamental gap in the expectations from data science teams across industries. Data science is perceived as a magic wand, that has problems to all problems. And investing high budgets in the magic wand, is the solution!

Data Science is an enabling function, which can only give insights to the operational/business units on what do, based on the as-is process & data available.

Data Science is not a magic wand that has solutions to all business problems.

Below are a few reasons why data science projects fail:

  • No clear definition of the problem to be solved.
  • Unrepresentative sample of data being considered.
  • Lack of data transparency.
  • Labelling Issue — Issue with data labeling at the source.
  • Spurious correlations — Casual relations among data points, which might be due to wrong entries.
  • Lack of relevant data.
    And the list goes on.

Bob Violino , in a CIO article has outlined 8 primary reasons, why data science projects fail. Checkout this article for an elaborate explanation on why data science projects fail. 8 reasons why data science projects fail

A common thing among all the reasons listed above , is lack of understanding the as-is process. This is where Six Sigma Philosophy can come to your rescue, that helps ask the right questions , ensure right data is being used and there are no process gaps in the data collection process.

Consider a scenario, where a group of data scientists are trying to predict sales of a B2B product for the next quarter (on a daily basis) based on historic data available. The forecasting model predicted that there will be a spike in sales every Friday . This seemed counter-intuitive and upon deeper interrogation it was learnt that the sales team have a habit of filling all the sales data for the week on Friday, which according to them is a free day and prefer to close off the week without any backlogs.
In this scenario, it isn't the fault of an algorithm to identify the spike on Fridays, such gaps in the process can be spurious and can lead to erroneous results.

Six sigma — the Process oriented Philosophy that has/holds answers to a lot of questions about why data science use cases aren't successful.

The practice of Six-sigma goes back to the 1970s and before, where companies had found success in improving process quality, and it has evolved over decades. According to ASQ, the definition of six sigma goes as follows: Lean Six Sigma is a fact-based, data-driven philosophy of improvement that values defect prevention over defect detection. It drives customer satisfaction and bottom-line results by reducing variation, waste, and cycle time, while promoting the use of work standardization and flow, thereby creating a competitive advantage. It applies anywhere variation and waste exist, and every employee should be involved.
Six Sigma is focused on addressing the causes that drive a particular outcome, and tries to identify the relation between the output Y , and inputs X through quantitative & qualitative approaches;

Y=f(X) is the fundamental philosophy of Six Sigma.

Six Sigma is a set of tools and methodologies for creating continuous process improvement. Irrespective of the industry, it can be adopted to improve company’s overall operational efficiency.
Six Sigma is focused on improving quality of a process, through one of the following:
* Reducing Defects
* Reduce variation
* Delight Customer
* Reduce Cost
* Reduce Cycle time.

Isn’t it true that most of the use cases data science teams work on in most organizations, fall into either of these categories?

Six Sigma uses the DMAIC implementation framework — a data-driven quality strategy used improve/optimize the existing process and achieve less than 3.4 defects per million opportunities.

A summary view of the DMAIC methodology, referenced from ResearchGate publication by Piero Mella.

Quality, a Key Value Driver in Value Based Management — Scientific Figure on ResearchGate.

Six Sigma — A consolidated & comprehensive approach to address core quality issues through a set of tools & techniques

Six Sigma and its similarities with Data Science

Image depicting how each phase of six sigma DMAIC is similar to data science life cycle

The Six Sigma methodology is analogous to the standard data science life cycle, in a lot of aspects.
The Define phase in which the core problem is identified, is largely equivalent to the problem definition phase in data science
The Measure Phase — where the interest is in identifying the current process, is similar to what is done in the Exploratory Data Analysis — in the data science life cycle.
The Analyze phase — where the current process is analyzed — is similar to the data cleaning & modelling phase of the data science life cycle.
The Improve Phase in which solutions are implemented — is same as the model building phase. It is important to note that, a lot of Six Sigma projects fail at this stage, like the Model deployment phase of the data science life cycle.
The Control phase — for monitoring the implemented improvements, and ensure corrective actions are taken, is equivalent to the Deployment & operationalization phase.

Six Sigma is a philosophy that creates intuition for data scientists.

Below are a few fundamentals in Six Sigma that are already being employed in different terminologies in the data science life cycle.

Six Sigma concepts interchangeably across Data Science

As outlined in my previous article on how closed world thinking in data science, where we look at data cleaning & pre-processing during at stages of the life cycle, this article by Maciek Lasota outlines how six sigma can help with data quality improvements, which is a crucial element of the data science process.

Six Sigma is not the solution for all data science use cases, it is a thought process that:

  • Ensures the right problem is being solved
  • Guarantees that there are no flaws in the data collection process
  • Facilitates right questions to be asked, before jumping into creating a solution.

Six Sigma helps put the analysis in perspective

About Intuition Matters :
Intuitive understanding can help everything else snap into place. Learning becomes difficult when we emphasize definitions over understanding. The modern definition is the most advanced step of thought, not necessarily the starting point. Intuition Matters in everything, and it matters the most!

--

--