Planning Machine Learning Projects

Published in

Cogita Blog

6 min readJun 20, 2024

Today, I’d like to introduce you to the process of planning Machine Learning projects, as practiced in my company, COGITA. If you’re involved in ML projects — whether as an analyst or a Data Scientist — you’re likely either participating in this process or closely utilizing its outcomes. So, stick around till the end of this article, and you’ll surely find something relevant to your work.

Agile Approach to ML Projects Instead of Detailed Planning

First and foremost, when planning ML work, remember the agile philosophy — agile project management. In this light, the aim of the planning process should be a preliminary sketch and outline of the main phases, rather than a detailed plan of all stages. However, decisions regarding details should always be made on an ongoing basis, for example, at the end of two-week sprints.

Usually, though, you may need to define the timeframe for the project somewhat rigidly. What to do then?

There are several ways to handle this. Firstly, it’s valuable to have access to historical projects from either yours or your team’s portfolio. When starting a new project, you can quickly find projects with similar complexity or from the same domain and see how long they took.

The second approach involves setting a time limit but not defining expected metrics (similar to what I discussed in this blog about limiting scope with a set timeframe). It’s very difficult to predict how long it will take to achieve, for example, 90% model accuracy. But it’s quite easy to plan to spend X time on initial training and Y time on adjustments to achieve the highest quality.

The third approach is to only evaluate the Proof-of-Concept (PoC) stage and provide, for example, time ranges for the full algorithm-building stage. The PoC stage allows for preliminary data analysis and running initial models. You’ll know where you stand and how far you are from the desired outcome (remember to start with benchmarks).

Start with “Why?”

When embarking on a Machine Learning project, it’s important to first understand the problem you’re trying to solve. It’s worth asking simple questions: What is the intended model supposed to do? What need does it address? Will it be complete automation of a function, partial, or will the model only provide recommendations while final decisions are made by a human?

Consider the advantage this solution will have over the current state. Why was the decision made to apply Machine Learning to this problem?

By keeping the end result in mind, it’ll be easier for you to make decisions and plan work.

Gather Detailed Requirements

Now, think about the users of your model. They could be people from another department in your company, employees of your client (e.g., bank, store, hospital, etc.), or individual users of your ML-utilizing application.

Ask them why what they’re currently using isn’t sufficient. How do they envision the desired solution? Consult on the form of interaction with the model, the shape of the user interface, etc.

Try to gather all requirements regarding the model — what level of effectiveness is ideal, and what is acceptable? How long is an acceptable wait time for the model’s response?

Then, investigate constraints — will these people always want to use your model in every situation? Pay attention to trust in the model — since the model will make decisions about granting loans or not, beyond a YES/NO answer, users will likely require justification for decisions. Ask what kind of justification would be sufficient.

Our Intuition and Quick Hypothesis Validation

When designing any Machine Learning solution, consciously or not, you rely on your intuition about data, how the world works, and the capabilities of algorithms. Firstly, it’s valuable for this intuition to be based on as broad a range of information as possible. That’s why I always ask for a data sample before planning work on models and try to talk to the target users of the model to understand their experiences, needs, and requirements. I also conduct research on current approaches to similar problems and algorithms used. I’ve described these actions more extensively here.

Then, it’s worth documenting the assumptions (hypotheses) on which our intuition is based. The goal of the PoC stage should be to verify these hypotheses as quickly as possible. Initial work should really answer two questions: how wrong were we initially and what should we change in the further plan to achieve the goal.

Communication: People Replaced by Models

During the planning stage of the project, you must consider the aspect of communication. If you’re creating an algorithm for an external client, you should ensure that they’re available to address your doubts and plan the scope of sprints together.

Particularly delicately, you need to approach the issue of collaborating with individuals whose work is ultimately to be replaced by AI models. It’s worth familiarizing yourself with the increasingly popular approach of collaborative AI. In this model, the goal is to facilitate and accelerate human work using AI models, not to replace it. It’s worth thinking about this when designing AI solutions.

Divide Work into Stages

An appropriate division of the project into stages is one in which each stage completes a certain whole, which ideally should immediately add value to the user and be deployable. An example could be implementing a heuristic model or a model that works on a subset of data. In this case, the end user will see progress after each stage and be ready to invest further in the project. Let me give you an example. If the target model is to determine a customer’s creditworthiness, the first step could be clustering customers into groups with similar creditworthiness. The next step could be predicting a range of values. Only the final model could provide a specific value.

The second aspect is to consider that the end of any stage should allow for choosing different paths. This aligns with the agile philosophy, where you’re not certain that by completing the first stage, you’ll also decide on the second.

Example of Time Estimation

In my company, we use Clockify to log working hours. This allows us to fairly accurately say how much time we spent on specific tasks. Moreover, by comparing this time with the initial estimate, we’re increasingly building our intuition and ability to estimate project size. I recommend this approach to everyone!

I’ll provide a real example from one of the projects in a previous company, where the goal was to detect products and read their names and prices.

Here’s the initial estimation:

And here are the actual logged hours:

We see that the project took about 50% more time than estimated. Detection turned out to be much easier, whereas OCR for reading names and prices performed so poorly that we tried several other approaches (heuristics and NLP models). Moreover, 38.5 MD (almost 30% of the time!) was spent on analyzing the correctness of the solution and fixing errors.

Summary

In this article, I’ve presented several practices I use in planning Machine Learning projects. I suggest you trying to use at least one of them before starting your next project.

Planning Machine Learning Projects

Written by Adam Gabriel Dobrakowski