Machine Learning Engineers at Wildlife Studios
This article aims to briefly explore how Machine Learning Engineers (or MLEs) can bring more value to the whole ML ecosystem at Wildlife, helping our teams increase their ML-related productivity and reduce technical debt.
In our current scenario, Data Engineers (DEs), Data Scientists (DSs), and MLEs work closely, which could lead to a confusing work dynamic. When we reduce the scope to focus on the two most similar roles, we end up with MLEs and DSs using a very similar tech stack. At a high level, we are talking about Science and Engineering.
While DS need to deeply understand Math, Statistics, and Probability Theory, the MLEs are tasked to build deployable ML projects at scale using Software Engineering best practices — but of course, it’s not a rule, it’s just a matter of focus!
Role of MLEs
When we talk about the competitive advantage of certain technical roles, some may consider the factors and attributes that allow a team to consistently outperform in a given context. With this in mind, the MLEs are the most competent team to take ML models that were created by DSs and make them scalable, optimal, serviceable, while making sure they perform well in production.
In a few words, MLEs turn a predictive promise into a reliable and efficient piece of software. For us, there are three main paths to make a process more reliable and efficient:
(i) Automate it so that it’s easier and cheaper to be maintained;
(ii) Improve the process itself making it intrinsically better. For instance: Generating value by transforming the process to gain performance, time or reliability; and
(iii) Simplicity, most of the times ML systems are very complex and hard to maintain, so keep it simple.
In the next paragraphs, we deep dive into how MLEs can use their knowledge to improve ML processes and tasks, and discuss the ongoing costs associated with real-world ML systems and their technical debts.
MLEs and the ML Lifecycle
A typical ML project can raise many technical challenges by just putting a single model into production. Algorithmic complexity, scalability, latency, validation, and monitoring are examples of common concerns that can turn ML-based systems hard and costly to maintain in the long run. Ideally, we would design and build a whole pipeline to automate these blocks and solve every engineering problem effectively, however, in the real world things are more complicated.
Our aim is to deliver business value as fast as possible. In our view, MLEs can play a major role by helping teams grow sustainably. To the best of our knowledge, while DSs concentrate their efforts on creating as much impact as possible using data (first) and ML techniques (second), MLEs can collaborate with them by providing support, tools, and infrastructure to automate the lifecycle of the ML models.
When looking at the ML lifecycle, we can’t help noticing the skills that MLEs can bring to other teams and projects to improve their productivity. We work to create reliable and efficient infrastructures and a set of tools to enhance ML-related tasks. By deep diving, we are able to identify three main stages, which are common to most ML projects at Wildlife:
- In the first stage, we are interested in the experiment itself, creating new models and features to generate as much impact as possible using data.
Additionally, MLEs can focus on reducing the time of experimentation creating a reliable experiment environment, either by frameworks, model researchers, or libraries;
- In the second stage, we focus on putting the models into production in an automatic triggering setup. Note that this stage connects the Training Code and Serving Code steps, it tells us that the experiment code should be adjusted for production. i.e. the model is validated and we should be ready to write the serving code, test it, and do the deployment. Here, MLEs must consider the potential technical debts, however, it’s important to focus on implementing the best code practice and structure to the ML scenario, improving the reproducibility of models and predictions, governance, and regulatory compliance;
- Lastly, in the third stage, we are interested in serving the models in a reliable infrastructure and monitoring all aspects related to the model performance and execution. Nevertheless, note that even in previous stages, DSs, MLEs, and Software Engineers (SEs) should communicate and collaborate to define artifacts/metrics and other model specifications, but also to create log diagnostics to make sure that every ML-component is performing at the expected level and triggers alerts in case of anomalies or model drifts.
In summary, we understand that MLEs can bring more value to teams and projects by focusing on the following aspects of each ML-lifecycle stage:
- Data modeling and evaluation for ML task;
- Provide tools to create scalable and effective training runs;
- Provide tools to facilitate the experimentation;
- Automate processes to avoid manual errors, omissions, or oversights;
- Create ML tools that wrap the models to make them production-ready, reliably and traceable;
- ML tools to create an easier translation from experimentation code to production code with few extra steps, e.g. boilerplates and abstractions;
- Design and build ML pipelines managed by a CI/CD, or a scheduler such as Airflow;
- Provide automatic code linters and automated tests to improve code quality and reliability;
- Implement the serving code effectively;
- Create tests to ensure that the validated models are behaving as expected from both execution and performance aspects;
- Provide tools to monitor the models, e.g. the input feature vector and misbehaviors;
- Adapt ML runs to different infrastructures, e.g. Kubernetes and Spark clusters;
- Provide an automated deployment strategy that saves model artifacts and wrappers.
Hidden Technical Debt in ML
Since only a tiny fraction of real-world ML systems are composed of ML code (as shown in Figure 2), it is not hard ending up with technical debt all over the place. This can become particularly bothersome when dealing with legacy systems. Additionally, spending time to rebuild the necessary setup to have a ML in production for each squad can considerably slow down the throughput of a company, once squad members would have less time to work on their core activities.
By deep-diving into an example, we can make this scenario even harder. Imagine that a squad had been developing — in a part-time effort — their own deployment solution for two months, but they had to deprioritize both feature and model validation tasks during this period. This sort of effort, at a company level, could be dramatically reduced if we invested in having a unified platform that handles all this additional complexity.
Nowadays, we can be on top of that, due to the fact that the MLE organization is a compound of (i) MLEs working within squads and (ii) a platform team.
Another huge opportunity for improvement is by having MLEs working on ML-based squads as early as possible. By focusing on MLOps adoption, they can help mitigate technical debt and, in the long run, increase squad results. In case of late arrival, these MLEs usually have to spend a lot of time refactoring code and making migrations. In our experience, these types of activities can negatively impact squads’ value delivery and also cause both burnout and frustration for those involved.
- MLEs can bring value to the whole ML ecosystem by helping teams increase their ML-related productivity and reduce technical debt, but also by helping teams grow sustainably.
- MLEs should collaborate with DSs, DEs, and SEs in all stages of the ML lifecycle to provide support, tools, and infrastructure that improve ML-related tasks, especially in the mid-long terms.
- ML systems can rapidly accumulate technical debt, once they combine the maintenance challenges of traditional software and ML-specific issues. MLEs can help teams and projects be on top of that.
Originally published at https://medium.com on December 13, 2021.