DMAIC’s importance in data projects

DMAIC one of the core tools used within Six Sigma methodology has direct application to data projects.

DMAIC is an acronym for Define, Measure, analyze, improve and control. It refers to a self improvement cycle that enables to leverage data to generate incremental process improvements.


“person raising left hand” by julio casado on Unsplash

The definition step in data project is the most important one in order to define the scope of the opportunity and the relevant metrics that we wish to impact.

It represents an area where data or quantitative product managers can lead the way, defining what are the metrics that we want to impact and the metrics we care about not impacting negatively. It is important to make explicit which tradeoff we are willing to make. For instance, in an e-commerce setup trying to increase our overall conversion rate might decrease average order value (AOV). Up to what level are we willing to sacrifice AOV for a higher conversion rate? These are questions that are usually best answered when thought up upfront as they make it more explicit what a true opportunity represent.

It is also product work can help identify where potential high level opportunities lies through benchmark or opportunities identified elsewhere. These can serve to establish a goal and a business case for the project.

Part of the definition role is also to define and scope how we plan to impact the different metrics. It is very easy to get lost in data discovery when facing large and varied dataset, having a definition of how we plan to generate impact allows to narrow the scope of the discovery effort necessary and create focus. For instance in order to increase ad revenue for an advertising platform one might want to make the explicit choice to increase ad revenue by showing more relevant ads out of the existing inventory rather than looking at how to improve it through a project impacting inventory acquisition.

Certain methods traditional to Six Sigma can help in the definition phase such as Voice of Customer collection (VOC) or Critical to quality tree (CTQ).


“selective focus photography of tape measure” by Siora Photography on Unsplash

The measurement process normally relies on the collection and surfacing of information needed to improve the process or product and establishing baselines. Depending on the specific situation, data collection & surfacing strategy can involve the creation of data pipelines, aggregates (cubes), dashboards, surveys …

A data collection strategy is core to this process, in this step we need to make sure that we gather the necessary information to both cater to our core metrics but also collect the information necessary for the analysis step such as potential factors impacting our metrics. The data collection plan involves identifying source of data, minimum amount of data required, data collection effort, identifying operational measures to serve as proxy for certain metrics, sampling rate … Master of Project Academy goes into more detail as to how to create a full fledge data collection plan, for a lot of data project this might go in more detail than necessary but it can serve as an exhaustive example to guide you towards your data collection process and requirements.

It should also be accompanied by a data surfacing strategy, One of the biggest challenges in data projects is to ensure data quality. The measurement process should try to surface the data to the highest number of people and having as many feedback loops as possible. Measurement errors have drastic impact on analysis, decisions, models predictions or objective. In the process of surfacing data a lot of issues in the data collection would be surfaced and will need to be cleansed. Bertil Hatt in his post what does bad data looks like for instance highlighted a number of data quality issues that can be tackled as part of a surfacing effort.

In a lot of data-science projects the value that this step has is often under-valued. A lot of impact can be leveraged from more traditional analytics project in leveraging data that has previously been collected and surfaced. Having analytical employees who are dedicated to the particular domain help drastically speed up the measurement process by enabling a quick sourcing of the data as well as being able to pinpoint the potential pitfalls in the collected datasets.

Depending on the particularity of the team, this is an area that might require a lot of cross-functional work from product, user research, data engineering, data science, analysts and domain experts.


“white spring notebook” by rawpixel on Unsplash

The analysis step should be focused on identifying relationship between the factors, and to try to understand the casual relationships between the variables and our core metrics and what might impact them. It is used to “validate” or “refute” certain hypotheses generated during the definition phase.

Data deep dive into different dimensions to identify sub-segments behaving in a different ways, correlation analysis, propensity models, pre-post analyses … are the general tools of the trade of the analysis step. The general setup within the analysis can be quite varied ranging from testing some new predictors for a predictive model to see if it could generate an uplift in prediction accuracy, to seeing if watching a cat video on your website would make you more prone to purchasing, to more traditional six sigma techniques such as the five whys.

Quantitative vs. Qualitative Methods by Matt Lavoie

The type of analysis carried out during this process step needs not be purely quantitative but can also be of qualitative nature. As noted in the @indeed-data-science’s blog post being able to mix both generally enhance the breadth of the analysis and can help correlate certain factors and direct the quantitative analysis process.

The analysis step is meant to identify the leverage points in our factors with respect to our core metrics and to help define the next step for improvement. Through the analysis process we are meant to get a better understanding of what should be prioritized during the improvement process. We are meant through analysis to get a certain benchmark with respect to the scale of impact we can have by improving a certain factor by a certain amount as well as the feasibility of doing so.


“man in green long-sleeved shirt doing a push-up on gray concrete pavement” by Sammy O. on Unsplash

The improvement process should be data driven and informed by the analysis step, the type of improvement should be guided by the scope of the project and can lie from doing some UI changes based on an analysis to impact conversion rate, changing pricing rules on the websites to decrease negative margin sales, sending email to specific customer segments to improve relevancy of communication, or putting new predictive models live to increase video engagement.

What has been identified as a potential improvement during the analysis step should be piloted and evaluated. The aim during that step should be to impact the underlying factors or root causes of the process to drive lasting improvement. The focus of the improvement process is on the simplest and easiest solution to the problem.

Experimentation frameworks such as experimental design typically vulgarized as A/B testing can be used to find potential solutions & improvements without fully implementing them. Machine learning processes also benefit from using experimentation frameworks where A/B testing in production is usually referred to as an online test .

Focusing on the simplest and easiest solution and using an experimental framework allows for a plan, do, check, act continuous improvement cycle to test different solutions or variations, validating as we are go along our understanding of the underlying factors.


“person holding airplane control panel” by Chris Leipelt on Unsplash

Introducing a certain level of control ensures that we are not faced with regressions and that the improvement carried are lasting. Implementing A/B test hold outs, weekly business review (WBR) and variance analyses help bring about the tool and process to monitor and track these potential regressions.

Going through a data driven journey is an iterative process, don’t assume that it is feasible to go straight towards advanced analytics without having gone a few times through a DMAIC loop and having installed some control process on the data. Data that is not under a control process can be of poor quality, yielding to models having little to no business values.

DMAIC has been used to improve process from manufacturing to UX. In data projects, one of the area where DMAIC particularly shines is on improving the data quality throughout the different iterations of the process, yielding drastic improvements to the output of the analytical processes.

It is therefore primordial to first established the processes needed for establishing control of the data. This usually involves a lot of data engineering work for data collection, analysts work to build report and dashboards and embedded the data collected into the business.