Agile Data Science, is it so rosy as it seems?

Published in

TotalEnergies Digital Factory

6 min readOct 24, 2022

Hi everyone, today I’m going to speak about the topic which came to my attention and inspired me a lot during this year — agility applied to data science projects.
And if you are someone who doesn’t believe in 5 Agile Data Science myths (https://www.datascience-pm.com/5-myths-of-agile-data-science/), you are in the right place.
Just to give you some context, in our organization the Delivery Data Scientist is a part of a Squad (Dev Team + Product Owner) that builds the application and delivers the MVE (Minimum Valuable Experience) to the final user.
So far, I’ve worked with teams where the agile process was organized differently from one squad to another. In this article I would be happy to gather my thoughts and experience about the agile process and its application to Data Science projects.

Context

When it comes to working on the development of a digital application, we attempt to deliver usable insights and application features frequently, iterating with the final user to increment their feedback as soon as possible.

The number of machine learning models deployed in production increases every day, inciting us to treat them with software development practices. Here, agile has a role to play.

It’s no secret that a correctly introduced and adapted agile process boosts the production efficiency of the team. How could we achieve it?

Definition of Agile 🤸

Firstly, let’s understand what agility means. It is an ability to foster adaptation to change, which helps you organize your work according to responses from your users. See Four values and Twelve Principles of Agile Manifesto: https://agilemanifesto.org/principles.html.

Thus, is Agile Data Science simply the combination of Agile + Data Science?

Traditionally agile practices are used for classical software projects. But it could be really advantageous to find the way to apply them to development and industrialization of your machine learning models.

Basically, agile could be a great tool to clarify requirements and to refine the real needs in the delivery context.

Incremental delivery on the other hand helps users gain value before the end of the project. With the continuously iterated user feedback 🔄, we build confidence with the final user, putting them in the center of the application development and assuring business value and interest in the final product.

Obviously, efficient organization of this communication between the PO (Product Owner), team, stakeholders and final users is an essential ingredient to the success of the product.

Agile Data Science challenge

We clearly understand the benefits of agility, but why is it still so challenging for most data projects?

🔴 Model & Data

First, it’s the difference in nature and objectives. When we’re talking about a data science project, it’s not only about the code, but also about the model and the data, which means more artifacts to manage:

We need to store and track the candidate and final models of our experiments, their precision compared to the previous models, input/output training data, production data.

To this, we must add the monitoring of fast changing production data, to ensure the robustness of the model. Unexpected changes can lead to more cleaning and more processing, in order to sanitize the inputs of the model.

🔴 Uncertainty

Even for the most refined and clearest scopes, there will always be an uncertainty to manage (for example algorithm implementation, data management, etc … ), which is normal, and should be accepted if we ever start to develop a machine learning project.

In fact, we can do a lot of POCs about what we’re going to deploy, but once in production it will be another story. So it’s no surprise that during the delivery we’ll get involved to analyse and re-analyse the results of the algorithms and the implementation of our model.

So how don’t we get lost in the forest of explorations? One of the things to keep in mind is the user expectations and their feedback: the sooner we get it, the faster we’ll find the best way.

Let’s imagine we need to explore which of the new features will improve the final model results. In this case, it’s important to find the tradeoff between different expected outputs (features to be prioritized and analyzed, error be reduced by x %, etc..), the consistency of the feedback (so the user could give it) and the time to explore.

Don’t forget to create a Spike (ticket for a POC) and be aligned in advance about the expectations and time to spent on this exploration.

These baby steps help us understand better the direction and change it if needed.

🔴 Time to value

Another challenge is to get feedback for data projects and estimate their value. Obviously, we need to give time to the model, so users can observe it and “feel” how it works. It challenges and guides us in creating our algorithm, building it brick by brick with each iteration.

So, start small and get huge. In most cases, there is no need to test advanced machine learning models to get the first results.

Even one simple business rule could produce the value: good for users as they’ll see what it would be like, and good for the team as we’ll have a baseline to compare to. Simple models also allow the team to gain more time for other tasks (like front/back development, etc).

After the first version, we can easily show each improvement and how much it brings to the user.

All these points definitely add complexity to the agile process, but at the same time it challenges us as a team.

Agile framework to adapt

So far, we have been talking about effective communication to achieve our final goal. How should we organize the work, and find the way to collaborate, when there is such a huge difference between software and data science projects?

Sure, it demands adaptation of the agile framework , because when the agile practices were created, their initiators didn’t have in mind machine learning projects and all the complexity they could bring.

Do you remember the twelve principles of agile?

So, when we deal with a data science project, it seems that we need to be more than agile because of all these challenging points we’ve mentioned before.

We could use an existing framework (there are a lot of them, Kanban, Scrumban, etc. See https://www.digite.com/blog/scaled-agile-frameworks/), combine them or even create a new one — just remember that we need to instantiate agility for our specific case. That means finding the rituals, best practices, and framework to be at the same time an agile Development + Data science team, and not only a software development team.

We shouldn’t apply the same practices on different projects. Like life situations, each project is unique and needs to be treated differently by re-adapting our style of work 😉.

Final notes

Unfortunately, there is no such thing as a magic list that can cover all your needs, simply because of project differences. 🪄

However, let’s summarize starting points so you can grow your own agility as a team through “everyday” iterative discussions:

🏁Focus not only on agile, but also on how to adapt it effectively to your own data science process. Think about an efficient agile framework that can always be adapted during the building process, and assure continuous effective collaboration and communication with all members of the Delivery Team, the Product Owner (PO), and the final users.

🏁Continue to ask yourself iteratively: “ Is it still effective? Is it working the way we’ve thought? If there is something that slows us down instead of accelerating? If something have been changed and we need to review some elements of the framework?”

🏁Develop your own dynamic approach how to collect the user feedback, how to challenge users, and how to treat their reactions.

🏁Prioritize and/or adapt your backlog according to user/PO’s demand (they’re at the center of the application).

🏁Don’t try to deliver a lot of things at the same time, sometimes the user needs more time to get used to new features and to give you feedback. Otherwise they can get lost.

🏁Make sure that by the end of one complete iteration, (finishing by presentation to the final user) you have learnt at least one thing, so you can refine and improve your results.

🏁Measure each iteration via user demos and monitor the KPIs, learn from them and adjust.

Finally, agile in data science is much more, don’t only try to do agile but be agile!