An actionable and structured framework for data science projects

Abhiram Prasad
MiQ Tech and Analytics
10 min readJun 10, 2021

Most often than not big data and data science projects which probably would have taken months for development hardly make it to the production phase and get shelved off in the experimentation phase. According to Venturebeat, and Gartner, 87% of the data science projects and 85% of the big data projects never make it to the production phase. What is more frightening is that according to a recent Gartner report, only between 15% and 20% of data science projects get completed. Of those projects that did complete, CEOs say that only about 8% of them generate value.

There might be multiple reasons why a data science-driven project does not make it to the end. Some of the most common problems include:

  1. Identifying and solving the wrong problem: Some projects start with an extremely fuzzy idea of the goal or sometimes the goal may not be realistic or some other times the goal may hardly add any business value. To mitigate this risk a significant amount of time needs to be spent in the discovery phase while asking the right questions and diving deep to truly understand the underlying problem. Any project can be broadly divided into 3 phases-Discovery, Implementation, and Measurement. Not spending enough time in discovery and measurement while spending more than 80–90% time in implementation would eventually not add any measurable value to the business. Ideally, it is recommended that we spend around 25–30% time in problem discovery,40% in implementation, and the remaining 30% time in measurement and maintenance. The discovery phase involves identifying and prioritizing the right problem using a prioritization framework. Spending a good amount of time in the discovery phase prevents a team from becoming feature-building factories and ensures hight impact products are developed. Likewise, efforts in the measurement phase allow us to measure the impact and value the product added to the business in terms of the success metrics and use the feedback to further improve the product.
  2. Not having the right data: Data is the fuel to any data science project. Not having accurate, reliable, and unbiased data can majorly hamper the success of a data science project. Having thorough data sanity checks, research on what data is available, how the data obtained, collected, and stored would help mitigate the risk of incorrect and inaccurate data.
  3. No strict framework: Sometimes when the data science team lacks discipline crucial steps like business problem understanding, success metric defining, extensive EDA are all simply skipped, and the data scientists are tempted to simply skip to the modeling stage. Having a standardized structured framework for the execution of a data science project would increase the chances of the project delivering the business value.
  4. Lacking constant, actionable feedback from the stakeholders: When there is no constant communication with the stakeholders in the business understanding phase or no constant feedback collected regarding the progress of the project.
  5. Overcomplicating problems: More often than not data scientists get tempted to create overly complex models when a simple one can be just as good. Even in cases when a simple solution may not be as good, it becomes important to weigh the odds as often complex models are not interpretable which would mean we would not have a clear idea of when the model would perform well vs cases when it would not be up to the mark. This can be avoided by asking the right question during the model selection and training phase.

How a structured framework helps in building successful data science projects?

80–85% of the projects fail before completion and there is a further drop off when the project implemented does not deliver value

Having a well-defined framework that all the data science teams working on multiple projects follow will help the teams with higher efficiency, brings in greater visibility, and also sees that all the projects have gone through certain stages before being brought into production. We for our data science teams in our organization had come up with a framework keeping the following objectives in mind:

  1. Bring in higher efficiency and improve the success rate: Sometimes the data scientists would be tempted to directly jump into an implementation stage but later realize and wonder how we tie the efforts into business value. The time and efforts spent might go to waste, hence decreasing efficiency. A framework that would help in ensuring and monitoring all the data science projects go through stages right from business understanding to validation and maintenance improves the chances of success since all the required questions/steps would have answered and taken before moving ahead to the next stage.
  2. Greater visibility across teams and the leadership: A framework that would improve cross-team visibility allowing data scientists, product managers, and product analysts to have a clear and in-depth(if someone is interested to work in that team)idea of what the other teams are working on, what are some of the key results and learnings in each stage which might help them within their own projects. Also, the leadership team can have a clear picture of which stage each project is in, how impactful a project has been, which projects can be shelved off, etc.
  3. Troubleshooting: It often becomes difficult to troubleshoot why the performance of the model is not up to the mark or why the intended business value is not getting driven when the project has come to implementation. Since using a framework we would be able to keep a track of what went through, what are some of the key decisions taken in each of the stages it would be easier to troubleshoot later on and narrow down and attribute the bad performance to a certain step or in a case when the intended business value is not getting driven it can perhaps be due to unclear success metric definition during the business understanding phase.
  4. Greater confidence in the results and easier validation: Sometimes it becomes difficult to attribute the business impact to the project. But in a case when a project has gone through a fixed and validated framework it would allow us to have greater confidence in the results and since all the success metrics would be clearly defined earlier the validation would become easier.

Considering the above objectives in mind, we would need to develop a standardized framework that all the data science projects can use. It should consist of multiple stages and each stage could consist of certain questions that ought to be kept in mind by the people working on in that stage which can be keyed in when the team is done with that particular stage. The main purpose of the questions is not prescriptive(which cannot be generalized)instead the goal would be to record some key steps and decisions taken, the reasoning behind them, and some key results obtained in each stage.

This really amazing flowchart shared in the Microsoft build documentation shows the data science life cycle beautifully.

Data Science Lifecycle

We in the organization have created a framework on Flock and JIRA that consists of all the stages that a data science project has to go through and each stage consists of a questionnaire that the folks that are responsible for that stage need to key in at the end/completion of that stage(in terms of development). The lifecycle of a DS project was broadly divided into 6 phases:

1.Business and requirement understanding: This phase comes under the discovery stage which is an extremely crucial stage. More often than not this stage is overlooked or not enough time is spent here. Not having a clear requirement understanding essentially means not knowing the purpose of the project which would result in not knowing how to use the results to add business value. This may lead to very high chances of the project getting shelved after the experimentation phase. Hence we have proposed 4 sub-stages under this stage-

  • Define the business problem-The completion of this stage ensures that all the projects have answers for a clear business problem definition, why is it a problem, what are the success metrics, who are the key stakeholders and where does it stand in terms of priority(using a priority framework like RICE).
  • The decision to use a DS/Non-DS approach: By the end of this stage we would have an idea of what kind of project it is going to be. Whether the problem can be solved with MIS, detective analysis, dashboarding, or forecasting, or a predictive problem. To come to this decision we need to have answers for what is going to be the scale of impact do we have the data sources, broad level approach to the problem.
  • Converting the business problem to a clear data problem: This is the stage where a business problem is converted to a clear data problem.Most business problems can be converted to some kind of a mathematical equation where we clearly know what are our input and output metrics.
  • Hypothesis building: In this stage, we come up with exhaustive hypothesis statements that can be tested. By the end of this stage, we would have rejected or would have failed to reject the hypothesis we had made.

2.Data and strategy: This phase comes under the implementation stage where we have stages like data strategy, Exploratory data analysis, feature transformation, and preparation.In this phase, we get a deep and extensive understanding of our data and make them ready for the model building phase.

  • Data strategy: In this stage, we identify whether we have all the required data sets, a rough idea of what and how would our training and test datasets look like, what would be the frequency of a training and testing pipeline.
  • Initial data exploration: This stage involves carrying our steps in terms of variable identification, univariate, bivariate analysis, missing values, and outliers treatment. Here we also get an idea of the data volume that we are dealing with. We also come up with some insights which validate some of our initial hypotheses. The questions are framed in a way where we have a record of how all the above-mentioned steps are carried out.
  • Data preparation and transformation: In this stage, we keep a record of how data pre-processing, derive significant features, reduce dimensionality, feature selection, scaling, and transformation to bring data into a form consumable by the ML algorithms was done.

3.Model Building: This is the core or meat of any data science project where we tune and train the model bettering a chosen evaluation metric.

  • Choosing the right evaluation metric:In this stage, we try to identify all the possible evaluation metrics for the project and chose the one which aligns with the business objective and success metric we are trying to improve.By this stage, we also have an idea of what are the factors that give better values of the evaluation metric
  • Model selection and hyperparameter tuning: This stage keeps a record of how the model selection was done, how was the cross-validation performed whether the test dataset was used anywhere to select the model, also comparing the train and test performance in terms of the chosen metric. Apart from this, it keeps a record of how hyperparameter tuning was done, what was the criteria of optimal hyperparameters, whether the tuning can be automated, etc. Keeping a record of this stage primarily helps later in troubleshooting to identify and narrow down a reason for unexpected model performance.

4.Testing: In the testing phase we test the tuned and trained model with a completely new and unseen dataset. This gives us an idea of how the model is going to perform with real-world datasets. The testing phase can be divided into two stages model testing and historical data testing

  • Model Testing: This stage ensures that the trained model is tested with a completely unseen test dataset and not a validation set. We also get answers to questions like what was the difference in performance between train and test datasets, whether there was any evidence of overfitting and if the test dataset went through the same processing as train etc.
  • Historical data testing: We test the model with multiple sets from the historical data to test the model in this stage. It is ensured that the data sets chosen are not biased and also see how the performance is compared to what was obtained in model testing. Reasons for cases where performance is below average are also identified.

5.Scoping and developing an activation and validation strategy: We scope out an activation and validation strategy in this stage. We find answers to some of the most important questions like how do we use the model results which should also have been discussed in the business understanding stage. We also scope how we would measure the business value on the agreed success metrics, the buffer period after predictions to do the validation, how we would automate the training and testing pipeline, etc.

6.Project Review: This is the project end questionnaire which has questions like identifying some of the challenges that were unaddressed or need further exploration, scoping out next steps, whether the project receives a signoff from all its stakeholders, etc. This is a very important step as it not only shows to what extent the project was successful but also allows us to plan our next steps.

In summary, having a structured framework that is used across all the data science teams will ideally increase the success rate of data science projects, improve the visibility across teams as we get to know the details of what each team is working on, makes the troubleshooting much easier. Apart from this, it would mean having consolidated and structured documentation and record of the project.

--

--