The collaborative process of building a Machine Learning platform
In this post, I’ll share my experience during the process of developing a machine learning (ML) platform, how we’re approaching it at Adevinta and the effort and collaboration it involves. From the moment we started working on our own ML platform, we had one thing clear: communication is key. We’ve stuck to this principle throughout the whole process and I’m sure you’ll see that by the end of this blog.
Machine learning has been gaining momentum over the past few years, as more and more companies are realising the potential of introducing it into their products. However, it’s not yet a mature market, as the companies behind all the tooling are also relatively young and in a growth stage. There are several options in the market when it comes to ML platforms, but due to the scale of our company and the amount of data teams working on it, we decided to internalise processes and create a platform that adjusts as much as possible to the needs and the context of our teams.
At the time of writing this article it’s been nine months since I joined the CRE team. We’re in charge of developing the in-house ML and Big Data platform, used by data teams across the company. The team is split into two squads: the Core squad, formed by DevOps engineers in charge of managing and maintaining the infrastructure of clusters; and the Data squad, focused on integrating Big Data and ML services onto the platform. The need for our team came from a simple premise: Adevinta as a company wanted to ensure that all data teams could focus their time and resources on actual data projects, without having to worry about infrastructure and all the dependencies that come with it. I could elaborate a bit more but I truly feel you’ll have a clearer idea of both our team and our product, Unicron, if you take a look at this amazing blog written by my colleague Daniel Megyesi.
The starting point
The process of developing the ML platform started by identifying the different steps that comprised the ML pipeline, from data extraction to model serving:
We also identified other services that were needed to make everything work, but that aren’t necessarily related to the active process of implementing an ML model. Services such as: Model Registry, used to store all models and make them easy to serve; Experiment Tracking, to facilitate reproducibility and reusability, or simply the integration with other platforms within the company.
Another important aspect of our product is that it was aimed at reducing the cognitive load of our users. To do so, we wanted to offer documentation and tutorials to help them get started, provide them with the right resources that could deal with the particularities of the platform and make sure everything was properly integrated. But creating this experience is a tough task and at that point the product had some limitations that we also needed to address.
The proposed solution
Based on the current needs of the platform and on a lot of useful user feedback, we defined some principles to describe how the ML platform and its development should work:
- Have a single entry point for all ML activities
- Offer our users a clear and consistent golden path on how to use the available resources
- Focus the iterative delivering process on user productivity, having in mind the needs of different users (not only experienced engineers)
- Integrate several internal platforms in a single place, but only for the specific purpose of ML
An iterative process: Kubeflow integration
As mentioned earlier, we started this process with the first part of the ML pipeline covered by Notebooks and Pipelines. Later on, we added Katib and Training Jobs, all solutions provided by Kubeflow. Our intention was to integrate Kubeflow components one by one based on their individual maturity, instead of basing the whole platform on Kubeflow from the beginning. This has helped us deliver value incrementally and right from the start.
So far, it’s been an iterative process that has advanced in parallel to the internal development of Kubeflow. We’ve had good feedback from our clients and we’re constantly expanding the number of teams that operate within our platform. Regarding Kubeflow itself, we think it’s a promising project but it isn’t yet at the maturity level we’d like. It is, however, a good long-term bet.
Having this in mind, the team decided to approach the integration of all these new services in accordance to maturity levels. We already had a lifecycle system for clusters and this was a good opportunity to better define the limits and restrictions of the services that could be enabled in them. Below, you have a schematic of the relation between the lifecycle of services and clusters. I won’t go into more detail, this is just for you to see how we integrate new services to our platform.
Next step: MLFlow integration
We’ve changed our direction slightly for the next stages of our platform and our scope for this next quarter. Instead of going with the options Kubeflow offers, we’ve decided to try MLflow for Model Registry and Experiment Tracking. We didn’t come to this conclusion alone, collaboration has played a huge role. We opened a space to debate the development of these two topics to all interested data teams in the company in order to understand their context and needs. From this we came to some conclusions; bear in mind that these are specific to Adevinta’s context:
- These two features look promising in Kubeflow, but are currently going through an important redesign
- MLflow seems to be the best open source candidate despite its simplicity and is already being used by some of the company’s data teams
- The MLflow SDK is very powerful and easy to use
- The MLflow architecture seems very convenient to build on top of
- MLflow allows us to build bridges between its Model Registry and other data platforms from within Adevinta
- We’re open to returning to Kubeflow once these two features show more maturity
We dedicated several sessions to making these decisions and a lot of people were involved. It’s been an amazing experience to be so close to our clients and to learn from them.
The secret ingredient, cooperation
As I said before, this process wouldn’t have been possible without the constant collaboration of a large number of teams and individuals. We’ve created several spaces to share the development of the ML platform and to discuss technologies, experiences, learnings and definitely mistakes. For us, it has also been an amazing opportunity to understand the needs and interests of our clients, both for their short-term and long-term plans. We’ve democratised the prioritisation of the development of the ML platform to make the most of each delivery.
The best part of this process has been its decentralisation. The different teams have done their own research, evaluated the pros and cons of different tools and shared their conclusions and perspectives with everyone. To give you an idea of how these processes work, here’s a brief summary of each of our ceremonies:
The Sync is a quick meeting of about 15 to 30 minutes that we do every two weeks. It basically consists of the CRE team sharing updates on the development of the platform. We lean on a schema of our roadmap to make it easier to understand what’s the scope of the team for the quarter and how we’re advancing with respect to it. Finally, the other teams also have the chance to share any updates they consider relevant.
ML Community Session
This is where real sharing happens. In this ceremony, the floor is open to anybody that wishes to run a demonstration. It can be a tool, an integration or any other experience or learning that could be useful for other teams. So far, participation has been great and during the last year and a half we’ve had a total of 26 demos, about more than 16 different platforms and tools, coming from 7 different teams plus several individuals. Everything we’ve learned in these sessions has been recorded and is already available to any data team in Adevinta.
Special Interest Group
The Special Interest Group has been the latest addition to the open ML ceremonies. The SIG is a group of experts and other users interested in a specific topic that meet regularly to share and discuss ideas. We started this ceremony with Experiment Tracking and Model Registry as the main topics. As you’ve seen previously, the result was a success and we arrived at some really useful conclusions. We hope that this model of ceremony can be repeated and improved as new challenges come up.
We started the ML platform with some ideas and lots of doubts but we’ve been certain from the beginning who our clients were and what teams were interested in joining us. As a result, we decided to pay attention to their knowledge and what they needed from an ML platform. A bit more than a year later, we’ve established and validated several spaces and strategies to collaborate with any interested Adevinta data team. Our long-term objective is to offer a modular and flexible platform that can quickly adopt new tools based on our client needs. We’re confident that we’re moving in the right direction and, more importantly, that we’re accompanied by the best professionals.
A personal note
Apart from a Data Engineer I’m also a philosophy enthusiast. I’ve recently learned a new theory deriving from Carl Jung’s work, called analytical psychology. Amongst other interesting topics, Jung defines archetypes, a different perspective through which we can understand our behaviours and drives.
Briefly (and probably unfairly) described, archetypes are a way to interpret the different moods, moments and mentalities you can find in your day to day. Some people have more predominance for one archetype but the majority of us balance several of them. There’s also the belief that we have all of them inside us, just asleep. In my opinion that’s an optimistic way of seeing life, as it means you can always learn and change.
I mention this because diversity is one of the pillars on which Adevinta stands. This means we have the opportunity to work with people from many different cultures, backgrounds and perspectives. In terms of Jungian analysis, this would translate into an ecosystem with a large and rich representation of archetypes that offers everybody the possibility to learn and grow, not only as a professional but as a person. Thanks to this we’ve been able to understand and create products that cover the needs of all our users at an international scale, while learning how to be a part of a connected and international society.