What you need to know before embarking on a Data Science Project

Sakib Mondal
Walmart Global Tech Blog
6 min readJun 2, 2021

Data Science/Artificial Intelligence is the buzzword for all kinds of business today. It has made inroads into systems be it software based or hardware based. However, there is still a myth about what is the right strategy to build an AI/Data Science based system. In this article, we will cover a few aspects today. Firstly, what are the guiding principles when building AI based systems? Secondly, what are the pitfalls of the current process of building ML based solutions? Finally, what should be a good platform architecture for an AI based system?

Guiding Principle:

Data Science solutions to business problems, like other approaches, will have a few steps which are common with others. Besides these, there will be additional steps related to Data Science aspects. The following are the lists of things to be considered for a Data Science based solution:

Steps for Building ML Systems
  1. Problem Definition

This is the step where we take a hard look into the business problem to be solved. One should not think of any solution approaches right now. Rather, should ask questions on what the underlying business problem is. Is this really a problem? If so, what is expected if we had a solution. In other words, we should look at the following aspects:

  • Crisp definition of the problem
  • Outline the assumptions about the problem and related data
  • Motivation behind solving the problem
  • Will the solution have significant benefit (does it makes sense from the ROI angle)
  • Who will be the stakeholders? How do they see the future solution?

2. Data Preparation

At the end of the problem definition, ideally you should have very good idea of what data is meaningful for the solution. Do we have the required data? How do we identify data sources?

  • Data Identification on outputs: Identify what will be the output of a solution. Do we currently have data for the existing system so that we can evaluate performance of the new solution?
  • Data Identification on inputs: Identify the data that will help in the solution. It is very important to avoid situations where we are unknowingly using output or some variables derived from output as input. Keep in mind the business context to see whether the input data will be available for the solution at the time of its application (e.g., if one is using previous day’s sales to predict today’s sales, one should check that at the time of running the solution whether the data about the previous day’s sales would be consolidated).
  • Exploratory Data Analysis: Once data is available on both possible inputs and outputs of a solution, do exploratory data analysis to understand data.
  • Data Preparation: Depending on the type of data needed, data transformation may be required. The transformations may depend on modality of the data (text/image etc). Based on data analysis, one should envisage features. Features can also be constructed keeping in mind machine learning principles like dimensionality reduction. Other aspect are missing data imputation and normalization. Keep in mind the type of data elements (numeric vs categorical). Transformation will differ accordingly.
  • Training/Test Data: Thoughtful approach should be taken to ensure that the data used to build the solution represents the universe to a good extent. Hence proper data sampling strategy like stratified sampling should be used. Though we do not want to bias the decision based on the output data, many a times it becomes inevitable to take the output data into account (particularly for unbalanced output classes) to make model perform well on less represented output dataset.

Like the training case, the test case should be generated without bias to represent actual distribution. Solution should be evaluated on the test case to check for generalization of the solution beyond the dataset it was built on. Training and test case generation would depend on the problem context a lot. For example, when thinking of streaming solution, training set should be based on the data on instances prior to a timestamp, while test cases can only be based on instance after the timestamp.

3. Solution Approach

When thinking about the solution approach, again think of the system as a transformation unit. Given the input and output combinations what is the best transformation strategy. By this time, one has a good idea about the problem definition. With this one will be able to classify one or more types with the transformation block. It may be a simple heuristic, a rule based system, an optimization problem, a classification or a regression problem. Yes, it is possible that we can solve the problem using multiple possibilities. It would be good to prepare a list to be pruned later.

A good solution approach depends a lot on the business contexts, type of data, data availability and other requirements. For example, if it is problem where do not know the ground truth or the problem is a warm start with no data, most probably unsupervised methods will be good fit. If we have enough data about the problem as well as expected output, it is a supervised problem — this again can be a classification or a regression problem depending on the type of output. Let’s think about the learning and application requirements. If it a problem where we can wait long for a solution but the solution quality should be really good (e.g. most of the planning problems) it is recommended that we build a large scale model with a lot of training. On the other hand if the requirement is that we apply the model in an online or streaming context (e.g. click prediction), model should be thin. Again training for such models can be on offline data or on online data — choice depends a lot of staleness of data with time.

The process of deciding the best solution strategy is a combination of art and science. If one is lucky one may have experienced similar problem instances in the past and already know what type of solution approach worked well last time. If the problem is something new, it makes sense to think from the first principle. One should try to build a rough mathematical model about the salient objective of the model. This will help in the selection of candidate solution approaches.

4. Experimentation

Experimentation involves algorithm selection and parameter tuning. For a problem final set of applicable candidate strategies will be selected in the previous stage. Each of these strategies will have its own sets of parameters (called hyper parameters as these are different from parameters of the problem). To select the best model selection and the associated hyper-parameter, the solution should be evaluated on these combinations. There are various approaches like Bayesian, random and grid search.

There is an orthogonal aspect of whether one wants to select just one algorithm or an ensemble of algorithms. Since performance of a specific algorithm depends a lot on the input data and different algorithms may perform differently on different datasets, an apt strategy is to use an ensemble approach.

5. Final Delivery

The final solution should be able to meet the SLA not only on the performance of the solution from quality metrics (accuracy/F-Score/standard deviations and so on), it should also meet the performance metric on computation (response time, amount of data handled). Deployment considerations are part of problem definition, and these should be met in the final delivery. Besides these, here are a few more important aspect with respect to delivery:

  • Documentation on the project: Combination of confluence page, presentation, user manuals and GitHub repo.
  • Architecture document.
  • Performance Quality metrics and dashboards for system performance over time
  • Proper guidelines on how to use the system, on control and monitoring of the system, and steps to retrain the system.
  • Current limitations and future extensions

In the next article, we will take a look at the pitfalls of the current process of building ML based systems, and propose a reference architecture to mitigate them.

--

--

Sakib Mondal
Walmart Global Tech Blog

Sakib is a Distinguished Data Scientist at Walmart. He has a keen interest in application of ML and optimization based techniques to practice.