Image created with the assistance of DALL·E 2

The One True Answer: How to Transform Machine Learning Research into a Working Software in Production?

The Journey from AI Proof-of-Concept into a Real Product

Published in

Agile Insider

15 min readMar 16, 2023

Transforming science/machine learning research into working software in production is a journey. We often know where the research journey starts but have uncertainty about how and when it ends.

We will make the required preparations for this journey and face its challenges equipped with tips and methodologies.

This post is not only addressing managers but can benefit anyone engaged in AI development in any way. Each can adapt the concepts described here into their day-to-day work.

This is a topic I am very passionate about [5]. I get to talk about it quite often and was repeatedly asked to put it into a blog post, so here it is. Let’s start!

[*In the context of this post, I use the term “research” to describe data science and machine-learning algorithms.]

The Pain

Regardless of the company size or data domain, we face challenges when beginning a new research journey.

“To Explore or Not to Explore?”

There is a balance when we want to explore more and more: “let's just try this one more approach, here is a new paper let’s just give it a go.”
While exploration is important, it tends to be unbounded and sometimes may drive us away from our target: we aim for our research / science work to end with an actual impact, into production, meeting the real world. And we want to repeat and deliver again and again.

And this is a non-trivial goal when according to Gartner [1]:

Only 54% of AI projects make it from pilot into production

Are You Ready?

Like preparing for any good journey, there are some items we need to cross off our checklist before we hit the road: Locate the starting point, pep talk with your fellow travelers, check the weather forecast, choose a course and plan.

You Are Here: Identify Your Start Point

First, identify YOUR starting point.
Your starting point is the maturity of machine learning adaptation in your company. This will influence the complexity and length of your journey to the mountaintop.

Is it a data-driven company? Decision-making is based on data and analytics? Is it an AI-Driven company? Meaning its main product is an artificial intelligence ability? Did the company have a first successful proof of value based on machine learning? Or is it already heavily invested in ML with many ML models for AI-based products deriving revenue?

In my own experience with companies in different stages, regardless of the stage, from time to time, it is important to:

STOP

And refine your methodologies, since before you’ll notice, you will move to a different stage and you need to get ready for it and adapt. Even companies at this last stage, heavily deriving revenue from AI may have skipped the necessary adaptation that is required for this stage.

The Pep Talk

Based on your starting point, make sure all stakeholders you are involved with are aligned with the EXPERIMENTAL MINDSET that is required for data-driven development.

The journey analogy is the pep talk you have with your fellow travelers before starting the climb. You need to make sure all start this journey with the right mindset for the team effort.

Many times this educational stage does not come smoothly. People who work with software developers will in no way accept a software developer releasing a new web page if he says he is only 97% confident this page will actually load.
“What do you mean you are not 100% confident?!”
When actually, in our case:

Machine learning is a piece of software with a confidence tag pinned into it.

So this alignment stage is important: it can be offsite for stakeholders, or a presentation you have with open discussion, whatever you can in order to explain the experimental mindset: how it differs from regular software development. People move uncomfortably in their chairs if they are asked to put a lot of funding into a black box. So demystify it, explain how you make sure you reduce risk, be transparent in your day-to-day, and gain organizational trust.

Make sure all stakeholders are aligned with the EXPERIMENTAL MINDSET that is required for data-driven development.

Mind The Weather

Image credits: left: Unsplash, Jonas Kaiser; Right: Unsplash, Annie Spratt

When going on a journey, you should definitely mind the weather, because when it rains, it pours!

Today’s weather is rainy indeed, with market leaders going through cutoffs and layoffs. Companies are being asked to show profit and not only the promise of growth.

Maybe you were walking on a path that was a really nice hike in springtime. But when it starts to rain we must quickly adjust our course, since this nice path may start to get really slippery.

But fear not, fetch your umbrellas, and let’s hit the road!

Make an Impact

Always and when it rains even more than ever: we need to make an impact.
Regardless if you are in a management position, feature owner, or data scientist: Each in our own domains can adapt in the light of those guidelines and improve our impact.

Basically, we have a balance between incremental delivery and big swings. An incremental can be: to tweak your loss a bit or retrain with additional data. A “big swing” can be a new state-of-the-art model, or a new architecture into production to replace an older existing one.

There are many ways to make it to the top. We can walk incremental step-by-step by foot. Or take the big swing with the cable car. Incrementals are not always the safest way to go!

The big swings a risky, and take larger effort and longer to deliver. Incrementals enable you to deliver value faster and with certainty. A value. Some value. However, incremental can only take you so far. Big swings are the differentiators. Those make us stand out from the competition, releasing this game-changer new feature like never seen before.

Incrementals enable you to deliver value faster and with certainty. Big swings are the differentiators.

Image Credit: left: Pexels, Christian Buergi; Center: Pexels, Tembela Bohle; Right: illustration by the author

MVP Mindset — With End In Mind

MVP — Minimal Viable Product mindset: let’s consider the below illustration to explain. Assume our end goal, what we want to release: it’s a car. It is a very large task. So you break it down into stages [3].

The problem is, that each of those steps in the first row has no value for the end user. The users will have to wait for a very long while to get what they want, we may miss the market opportunity or lose to a competitor.

With an MVP mindset, we make sure each step has some value. How can we release something, even smaller, that can create some impact? But I call it MVP with end-in-mind, because we need to make sure to keep pushing toward that end goal because our users will not remain satisfied with a scooter for a long while if they are expecting a car! Do not settle down on incrementals only. Make sure you eventually deliver the big-swing: the car!

In the context of machine learning, if your end goal is a detection problem, maybe only solving the localization before being able to classify well enough has an impact on its own.

Image credit: blog.fastmonkeys.com; Original idea: Spotify product team, Henrik Kniberg

The [Wo]man With The Plan

We should not start a journey without a plan. If you are a project manager / feature owner / tech lead: Epic elaboration is very important.

Plan your work. And work your plan.

The Mountain of Too-Vague Targets

Imagine our destination is the distant mountaintop in the image below. If that will remain our only defined target, I call that “The mountain of too-vague targets”. It is too far away, poorly defined, and introduces many uncertainties.

During the climb, you’ll get a sense of “We’ll never make it! It is so far away!” When the target is too vague and too far away we miss the feeling of progress. If that will remain your only target, the climber in the photo below will ditch and take the turn to the beach instead of keep climbing.

We must set visible targets. It is being argued that you can not put a time estimation on machine learning research tasks. They are too experimental and you have so many unknowns.

But in my experience, those huge tasks are the ones you need to break down into smaller pieces. It will enable us to reduce uncertainty and gain that energizing sense of progress.

For example, we could divide the end goal into the following intermediate steps and goals: Initial Data Exploration; Literature Review; Development of Business Metrics; Model Development & Evaluation Function, and so on.

Photography: Pexels, Eric Sanman, Icons: Flaticon via Slidesgo

Keep an Eye on the Compass

As we progress toward those intermediate visible targets, we need to make sure we do not lose our course. We need our compass right from the start.
Align with our product manager on business goals early on as possible. Make sure you are in agreement about the following questions:

What is success?
How do we measure it?

Recalculate Route

“If you are going to fail, fail fast”, -K. Richards

You are probably familiar with this sentence in the context of testing ideas for new products fast, even before investing in development, and see what feedback you get from potential users. For AI, it could be releasing a mockup of a new feature, before actually developing it, just to get the verification that this is the right direction to aim for.

But “fail fast” can also be adapted for the research phase. Did you come across a situation where as a data scientist you are so attached to a specific course of action that it takes you way too long to admit you hit a wall? And then if you think about it you actually understand you hit that wall a long while ago.

When I define an epic or I add the compass: “acceptance criteria” / “definition of done” / “what is success” field. But I also prefer to define an additional field I add to our projects: “stop criteria”. This is a field for early indicators that something may be way off. This is your “red flag”. If you hit the “stop criteria” it means you need to stop and rethink if this is still the right course of action given your updated status of resources and timeline.

A stop criteria can be: very low metric results combined with a deadline date limit. It can be stating an underlying assumption you rely on. If that assumption is proven incorrect you should not continue with that approach.

And if you happen to hit that “stop criteria”, you must take the time to record the reasons that the attempt failed. Because if you don’t, one of two things may happen: The first, is that in the future you may repeat the same approach again with the same mistakes. The other possibility is that you may later on come across a different situation and you will say “no, this approach is not applicable. We tried it before and it failed”. But this in fact might be the perfect approach since circumstances have changed and so are your assumptions and problem conditions. But you may skip that solution only because you did not record properly the specific reasons for the failure. Only after you documented the failure, you are now ready to adapt and get back on track.

Data-Driven Research

Research with insufficient data is based on unstable, weak foundations.
Be an advocate for data-driven in your company. Make sure to explain the importance of early data collection before the research / science work is expected to start. In a former workplace, it took me a while, but eventually, I convey that message and the product team started setting the algorithmic goals for the next quarter not at the end of the current one, but at the middle. In this way, we could immediately start collecting, cleaning, and annotating the data needed for the new features of the next quarter.

This allows us to reduce uncertainties, improve research, and have shorter research cycles.

A company with maturity in ML adapts to an Experimental-mindset: Planning Data Collection Ahead of Research.

MVP Meets Data Subsets

If you aim for this new ML feature to work from day one for all possible users, all possible scenarios, anytime and anywhere, you will have a hard time meeting quality requirements for all that AND being able to deliver some value in a reasonable time.

While in MVP we ask what minimal proposition we can release that provides value, in our case we look at data subsets [10] and ask:

What is the smallest subset to validate the research AND gain an impact?

This is especially helpful if your company is not fully aligned yet with data-driven research, and you struggle with data resources when the research phase starts.

Data subsets can be dividing upcoming versions of the new feature into camera types/ operating systems/ user types and so on.

Data subsets for Machine-learning MVP: Illustration by the author

From a Concept into a Working Software

Let’s say you experimented a bit and are convinced that this machine-learning problem is solvable. You also have a basic idea of how and what building blocks of algorithms you will need to develop.

What is the next step?

Test-Driven Machine-Learning

The following approach is for professional hikers only. I find this approach to be very useful to keep focus and improve progress.

I call that TEST-DRIVEN machine-learning. Inspired by test-driven development [6–7].

Business Metric: Develop an evaluation function to calculate that metric on a benchmark early on. This is not only the definition of the metric, but an actual working function to calculate it. This will highlight what data is missing in order for you to be able to measure where you stand compared to the target.
End to End: Build the pipeline [8] and API of the blocks. Inputs and outputs of each block, from one model to the next. Each block does not yet function properly, at this stage you only have the interface. This “contract” is especially important in case several data scientists are working on the full pipeline, each on a different block.
Test: Build Unit-test & evaluation metric
of each block. Those unit tests will initially fail since those blocks do not meet performance criteria when you only getting started.

Only at this point, you can now invest in research and improvement of each block. And iterate. All the moving parts are connected, you have a way to measure each AND the full pipeline.

This enables you to identify release opportunities earlier on and reduce uncertainty regarding where you stand compared to your end goal. This also enables you to be transparent and clear toward your stakeholders every step of the way.

Continues Review

I am not sure how many of you did a code review for a data scientist or a machine learning researcher. As a computer vision researcher, I was once asked to do a code review for a colleague when he finished working on a very big new feature. With tens of new classes and so many functions. I also had no part in the design or research of this feature. Insane! And I wish it was a one-time thing.

Epic elaboration is important. Break down the large research task into smaller measurable segments. Make sure you open a pull request on a weekly basis. Even if it has to be to a side branch: Open PRs. Even if it is not a full model yet, only the loss function, or only the valuation function. Getting your code reviewed will help catch bugs and wrong assumptions early on instead of declaring a failed experiment when in fact it was due to a bug.

Opening pull requests are not enough. You need to make sure they are merged! A PR that is open for more than a week becomes irrelevant and hard to follow. Be responsive and make sure your PR gets the right attention.

Peer reviews do not have to include only code reviews. You can get a peer review on your assumption, concepts, and design as well [9].

Make sure to break down your huge research tasks into smaller deliverables and keep moving. It is energizing to see those tasks move into “Done” status!

Inner Release

I am a big advocate for inner releases. An MVP can also be a working- software, end 2 end, that does not influence any user just yet. So where is the value you ask? The value is in demonstrating this approach is applicable and that the company has the ability to fully develop it.

An inner release is a full integration of the new data science feature in a staging environment or even in production just without influencing the user yet. You can start to work on the full integration even before the research is fully done: for example, you are at a stage where you are certain you will meet the acceptance criteria, but just need some more time to train on additionally collected data or tune thresholds. While you do that, different teams can already start working on the integration: backend, DevOps, UX, and so on. When everything is ready, just updating weights or thresholds is not a big deal software development-wise. In this way, you reduce the risk of implementation bugs, gain organizational trust when all are aligned on the full integration, and reduce the cycle from research to production. You meet the real-world faster and get your foot in the door for new data collection.

Releasing

When you are finally ready to release, do not skip sharing internally your release notes. Those should clearly state the capabilities of your new DS feature, but also its limitations with examples both ways. If you have remaining open bugs, make sure to list those as well. Alignment of expectations is key for organizational trust. Stakeholders should know what view to expect from that mountaintop. Demystify the black box and make it clear what it does great, but also what is the plan for the next version.

One more thing to consider is your monitoring and roll-out plan. Make sure you have the ability to automatically measure some of your main business metrics from production data. Set a control group and roll out carefully, gradually increasing traffic [11].

When is it finally out and stable, remember to appreciate the view from the mountaintop! Be a machine-learning advocate: raise awareness for this new feature. Share analysis from that new feature in production and where you stand with the plan for the next version.

Image credit: Pexels, Rodnae Productions

Concluding

At the end of the day, if you are looking for the one answer to all your challenges, it will (as always) be: 42 [4].
Putting nerdy references aside, there is no one golden recipe. And if there was, it will quickly no longer hold. Our reality keeps changing and so should we. The scale of our company changes, user and clients’ expectations change, and market climate changes. Agility is not only about development processes. We should be agile in our methodologies as well.