Setting up Machine Learning projects for success

A tried-and-tested guide to framing data science projects

Chris Hughes
Data Science at Microsoft
10 min readMay 11, 2021

--

A person looking thoughtfully at lots of papers stuck to a wall
Photo by Pexels on Pixabay

As interest in Machine Learning continues to grow, there are an ever increasing number of organisations wishing to undertake data science–based initiatives. However, despite best intentions, the backing of all the right people, and heaps of enthusiasm, it isn’t always clear where to start.

In Microsoft’s Commercial Software Engineering (CSE) organisation, we work side-by-side with our customers, on a per-project basis, to collaboratively innovate custom solutions to solve their challenging business problems. Under this model, members of the CSE Applied ML team have helped to deliver many successful data science projects, from inception to production — often starting from very vague or unclear requirements and high levels of uncertainty — where it has been our job to push for clarity and formulate the problem in a way that sets up the project for success, and ensuring that we can build a solution that truly solves the problem and delivers value. However, the approach to gathering requirements and framing the problem is largely applicable regardless of whether you are a consultant aiding a client, or part of a team starting a new project within your organisation.

This guide outlines the approach that I personally use, which is inspired by my previous experience and learnings, as well as those shared by my team. Some of the material presented has been aggregated from existing sources and based on the work of others; it is explicitly highlighted where this is the case. I hope that others find it useful!

Understanding the problem domain

Generally, before defining a project scope for a data science investigation, we must first understand the problem domain:

  • What is the problem?
  • Why does the problem need to be solved?
  • Does the problem require a Machine Learning solution?
  • How would a potential solution be used?

Establishing this understanding, however, can prove to be difficult — especially for those unfamiliar with the problem domain. To ease this process, we can approach problems in a structured way by taking the following steps:

  • Identify a measurable problem and define it in business terms. The objective should be clear, and we should have a good understanding of the areas that we can control to influence it. Here, we should be as specific as possible.
  • Decide how the performance of a solution should be measured and identify whether this is possible within the restrictions of the problem. We need to make sure this aligns with the business objective and that we have identified the data required to evaluate the solution. Note: the data required to evaluate a solution may differ from the data needed to create a solution.
  • Thinking about the solution as a black box, detail the function that a solution to this problem should perform to fulfil the objective and verify that the relevant data is available to solve the problem. One way of approaching this is by thinking about how a subject matter expert could solve the problem manually and the data that would be required; if a human subject matter expert is unable to solve the problem given the available data, this is indicative that additional information is required and/or more data needs to be collected!
  • Based on the available data, define specific hypothesis statements — which can be proved or disproved — to guide the exploration of the data science team. Where possible, each hypothesis statement should have clearly defined success criteria (e.g., “with an accuracy of over 60 percent”); however, this is not always feasible — especially for projects where no solution to the problem currently exists. In these cases, the measure of success could be based on input from a subject matter expert or stakeholder who verifies that the results meet their expectations.
  • Document all the above information to ensure alignment among stakeholders and establish a clear understanding of the problem to be solved. Try to ensure that as much relevant domain knowledge is captured as possible, and that the features present in available data — and the way that the data was collected — are clearly explained, such that they can be understood by a non–subject matter expert.

In order to break down a problem in this way, however, we require a lot of information! We need to understand the business problem, performance expectations, the available data, and much more. In addition, if we are planning to actually build a solution, we also need to understand a bit about the team that will be working on it and who would take responsibility for maintaining the end product in the event that the project is successful!

The way that we have found to work well for capturing this information is to start each project by conducting one or more envisioning sessions, where the data scientists, software engineers, and any other relevant project stakeholders work alongside subject matter experts to formulate the problem in such a way that there is a shared understanding of the problem domain, a clear goal, and a predefined approach to evaluating a potential solution. This ensures alignment and consistent expectations among all parties.

If you are working on a project as part of a team working close to the problem area, and you feel that everyone has a good understanding of the problem, the thought of conducting additional meetings focused around discussing this may seem like overkill. In my experience, however, it is well worth taking the time to do this to ensure total alignment and understanding among all key stakeholders at this stage. Dedicating time for a single, well-directed conversation at this stage can prevent a lot of headaches and misunderstandings further down the line!

Direction on conducting a successful envisioning session is presented below.

Envisioning goals and guidance

The main goals of the envisioning process are the following:

  • Establish a clear understanding of the problem domain and the underlying business objective.
  • Define how a potential solution would be used and how its performance should be measured.
  • Determine the data that is available to solve the problem.
  • Understand the capabilities and working practices of the data science team.

During envisioning, the following points may prove useful for guiding the discussion. Here, the discussion points have been logically separated into themes, and numbers used to imply a rough ordering, but this is only a guideline; often the discussion can organically move among these areas, and this is not a problem as long as all points are eventually discussed. Depending on the complexity of the project, multiple sessions may be required to elicit all of the required information. Many of these points are taken directly, or adapted from, Aurélien Géron’s Machine Learning project checklist [1] and Fast.ai’s Data Project checklist [2].

Photo by Glenn Carstens-Peters on Unsplash

Problem framing

  1. Define the objective in business terms.
  2. How will the solution be used?
  3. What are the current solutions/workarounds (if any)? What work has been done in this area so far? Does this solution need to fit into an existing system?
  4. How should performance be measured?
  5. Is the performance measure aligned with the business objective?
  6. What would be the minimum performance needed to reach the business objective?
  7. Are there any known constraints that would have to be taken into account? (e.g., computation times, non-functional requirements)
  8. Frame this problem (supervised/unsupervised, online/offline, and so on).
  9. Is human expertise available?
  10. How would you solve the problem manually?
  11. Are there any restrictions on the type of approaches that can be used? (e.g., does the solution need to be completely explainable?)
  12. List the assumptions you or others have made so far. Verify these assumptions if possible.
  13. Define some initial hypothesis statements to be explored.
  14. Highlight and discuss any responsible AI concerns if appropriate.

Data exploration

  1. Understand and document the features, location, and availability of the data.
  2. What order of magnitude is the current data (e.g., GB, TB)? Is this all relevant?
  3. How does the organization decide when to collect additional data or purchase external data? Are there any examples of this?
  4. What data has been used so far to analyse recent data-driven projects? What has been found to be most useful? What was not useful? How was this judged?
  5. What additional internal data may provide insights useful for data-driven decision making for proposed projects? What external data could be useful?
  6. What are the possible constraints or challenges in accessing or incorporating this data?
  7. How was the data collected? Are there any obvious biases because of how the data was collected?
  8. What changes to data collection, coding, integration, and so on, have occurred in the last two years that may impact the interpretation or availability of the collected data?

Workflow

  1. What data science skills exist in the organisation?
  2. How many data scientists/engineers would be available to work on this project? In what capacity would these resources be available (full-time, part-time, etc.)?
  3. What does the team’s current workflow practices look like? Do they work on the cloud/on-prem? In notebooks/IDE? Is version control used?
  4. How are data, experiments, and models currently tracked?
  5. Does the team employ an Agile methodology? How is work tracked?
  6. Are there any ML solutions currently running in production? Who is responsible for maintaining these solutions?
  7. Who would be responsible for maintaining a solution produced during this project?
  8. Are there any restrictions on tooling that must/cannot be used?

Of course, once an understanding of the problem domain has been established, it may be necessary to break down the overall problem into smaller, meaningful chunks of work to maintain team focus and ensure a realistic project scope within the given time frame as appropriate.

Example: A recommendation engine problem

This example was, in part, inspired by Designing great data products, by Jeremy Howard, Margit Zwemer, and Mike Loukides [3].

To illustrate how the above process can be applied to a tangible problem domain, consider, as an example, that we are looking at implementing a recommendation engine for a clothing retailer.

Often, the objective may be simply presented, in a form such as “to improve sales.” However, whilst this is ultimately the main goal, we would benefit from being more specific here. Suppose that we were to deploy a solution in November and then observed a December sales surge; how would we be able to distinguish how much of this was as a result of the new recommendation engine, as opposed to the fact that December is a peak buying season?

A better objective, in this case, would be “to drive additional sales by presenting the customer with items that they would not otherwise have purchased without the recommendation.” To influence this outcome, the areas that we can control are the choice of items that are presented to each customer, and the order in which they are displayed, simultaneously considering factors such as how frequently these should change, seasonality, and so on.

The data required to evaluate a potential solution in this case would be which recommendations resulted in new sales, and an estimation of a customer’s likeliness to purchase a specific item without a recommendation. Note that, whilst this data could also be used to build a recommendation engine, it is unlikely that this data would be available before a recommendation system has been implemented, so it is likely that we would have to use an alternate data source to build the model.

We can get an initial idea of how to approach a solution to this problem by considering how it would be solved by a subject matter expert. Thinking of how a personal stylist may provide a recommendation, they are likely to recommend items based on one or more of the following:

  • Generally popular items
  • Items similar to those liked/purchased by the customer
  • Items that were liked/purchased by similar customers
  • Items that are complementary to those owned by the customer

Whilst this list is by no means exhaustive, it provides a good indication of the data that is likely to be useful to us:

  • Item sales data
  • Customer purchase histories
  • Customer demographics
  • Item descriptions and tags
  • Previous outfits, or sets, which have been curated by the stylist

We would then be able to use this data to explore:

  • A method of measuring similarity among items
  • A method of measuring similarity among customers
  • A method of measuring how complementary items are, relative to one another

These can then be used to create and rank recommendations. Depending on the project scope and available data, one or more of these areas could be formulated into hypotheses to be explored by the data science team. Some examples of such hypothesis statements could be:

  • From the descriptions of each item, we can determine a measure of similarity among different items to a degree of accuracy that is specified by a stylist.
  • Based on the behaviour of customers with similar purchasing histories, we are able to predict certain items that a customer is likely to purchase, with a certainty that is greater than random choice.
  • Using sets of items that have previously been sold together, we can formulate rules around the features that determine whether items are complementary or not, and which can be verified by a stylist.

These statements can now be used as suitable starting points to guide exploration, and provide a clear initial direction for the project.

References

Many of the ideas presented here — and much more — were inspired by, and can be found in the following resources, all of which I highly recommend.

[1]: Aurélien Géron’s Machine Learning project checklist

[2]: Fast.ai’s Data Project checklist

[3]: Designing great data products, by Jeremy Howard, Margit Zwemer, and Mike Loukides

The majority of this material was previously published as part of the CSE playbook, which is available on GitHub, and consist of a collection of recommendations and best practices that have been aggregated by our team over many engagements.

Special thanks to my awesome colleagues Bianca Furtuna, Daniel Mouritsen, Omri Mendels, and Tess Ferranzez for their great feedback and insightful contributions!

Chris Hughes is on LinkedIn.

--

--

Chris Hughes
Data Science at Microsoft

Principal Machine Learning Engineer/Scientist Manager at Microsoft. All opinions are my own.