Adevinta Tech Blog
Published in

Adevinta Tech Blog

Introducing Unicron, our big data and Machine Learning platform

There’s always more to learn from Kubernetes and how it can make your life easier. In this post, I’ll walk you through the setup of Adevinta’s Big Data and Machine Learning platform, Unicron, developed using Kubernetes. This is for the tech part, but what I’d also like to share with you is how technology has affected our organisational culture and the way we work as a team.

The journey from Mesos to Kubernetes

Before 2020, Adevinta already had a common big data platform, enabling various data teams across our marketplaces and Central Teams to run batch data processing jobs. One of the motivations behind having a big data platform was economy of scale. By having one centralised platform, managed by a few dedicated engineers, all internal users including Data Scientists and Engineers could focus 100% of their time on “actual data stuff” instead of managing infrastructure or worrying about Hadoop clusters.

Fast forward to the beginning of 2021: our on-call team of four DevOps engineers are already managing a total of 59 Kubernetes clusters and hosting the centralised platform behind Adevinta’s big data processing and Machine Learning jobs. This number is constantly oscillating based on current business demands. For example, last October we had around 70 live EKS clusters as part of this infrastructure. In this article, we’ll explain what led us to this achievement — not just on a technical, but also on an organisational level.

Why did we build a new big data platform?

Our previous infrastructure was a multi-tenant, single Mesos cluster, with a corresponding Elastic MapReduce cluster, all running in Amazon’s cloud. The platform was managing terabytes of RAM and many thousands of CPUs in a connected Hadoop setup. Engineers had to create Spark jobs using internal tooling, package up each task as a Docker image and launch it in the shared Mesos infrastructure. We had limited autoscaling capabilities and while we didn’t have serious on-call incidents, it was sometimes challenging to manage the infrastructure because one mistake could affect all tenants at the same time. Additionally, due to the size, we were spending hundreds of thousands of euros per month hosting the system.

Due to growing needs and internal reorganisations in the company, we came up with plans for a new platform, where we could leverage the latest trends in the industry and offer better capabilities to our internal customers — teams at Adevinta — while also operating at a significantly lower cost. We dedicated more than two quarters to building the new system from scratch before we welcomed our first production customer with 24/7 on-call support. We started demerging teams from the previous platform in just a matter of months.

The new architecture

The new system, called Unicron, consists of many independent Kubernetes clusters. Every customer (data team) gets up to three individual, isolated AWS accounts with an Elastic Kubernetes Service (EKS) cluster in each — one account for each of their environments (development, staging and production).

This way we can manage each customer with more flexibility, as some need smaller machines and run lots of small jobs, whereas others just run one job but use hundreds of gigabytes of memory. Some teams have highly critical and time sensitive batch jobs, while others only run workloads once every few days and can sustain outages, so are willing to compromise on lower costs in return.

Due to the fact clusters are completely isolated from each other, we’ve achieved much better security levels than ever before, while also having a very clear cost segregation for each team, down to the last penny spent. The new system has also allowed us to support different reliability requirements and SLAs.

We’ve been operating the Unicron platform for almost two years and so far it has been a success, both financially and technically. On the one hand, we’ve reduced Adevinta’s yearly AWS bills by a million dollars, and on the other, we’ve not experienced any significant incidents or outages and now sleep extremely peacefully thanks to the carefully planned infrastructure and strict quality requirements we’ve set up at a team level. We’ve also been able to onboard more teams and workloads than ever before.

The team behind Unicron

Our recipe for a collaborative, great team spirit: yearly off-site team building events!

There’s no good system without a good team behind it. Before starting to design the new platform, we spent months reorganising ourselves, adding new members to our team and setting up common agreements — not only on technological choices, but also on our ways of working together.

On top of our usual daily tasks and OKRs, every week we created new proposals in the form of a GitHub issue, where everyone could add comments and opinions. We’d then have a dedicated, time-boxed session to debate each topic. It was a very democratic process, and we ended up with agreements that are documented in our internal knowledge base. From time to time, we revisit these documents and change or adapt them if needed. Each time a new member joins, they can go through all of the agreements and understand how we work.

These agreements are about many different topics, for example:

  • Which programming languages are accepted for new projects
  • What are the quality requirements for writing Prometheus alerts (we write unit tests for 100% of our on-call alerting rules!)
  • How to handle technical debt; what future guarantees do we commit to if we need to compromise on quality when facing time pressure
  • How to perform good and meaningful pull request reviews, etc.

We’ve agreed to dedicate time every quarter to upgrade at least one version of Kubernetes and take control of version upgrades to keep us up-to-date and avoid the external time pressure posed by AWS. We made sure our infrastructure is capable of handling this procedure and that it’s part of our normal daily tasks with close to zero preparation required. Last quarter, in just under two months, we successfully upgraded three major versions of Kubernetes to 1.18 during business hours, thanks to this approach.

When we start working on new projects or bigger changes, like introducing single-sign on authentication for our services or planning a migration to a new monitoring platform, we make sure everyone is an equal part of the process and owns the ecosystem we build and maintain together.

Today our team consists of 10 people from four nationalities and several domains of expertise, including DevOps and Software Engineers, Data/Machine Learning specialists and Product and Engineering Managers. We all work together towards a common goal: providing the best possible platform to make our internal users’ lives easier.

After we grew to a certain size, we split our work into two squads: one squad now works on the core platform, a strong and reliable base, while the other focuses on developing the customer-specific tooling on top of it, to provide better value for big data and Machine Learning capabilities.

The technology behind Unicron

Batteries included but swappable

The platform quickly evolved from simply being a place for big data jobs and a drop-in replacement for our previous solution to a platform that can answer new business needs. For example, running daily vulnerability scans for Adevinta’s Security team or allowing Adevinta engineers to leverage various Machine Learning technologies.

By slowly evolving the platform, we’ve isolated core common components, for example a control plane, the logging system or autoscaling. We called this Unicron Core, and on top of this, we offer optional components, for example:

  • Spark jobs
  • Luigi scheduler
  • Kubeflow or Katib
  • Custom Datadog monitoring

The teams are now able to install these well-tested components in a self-service way, like installing an app through the app store on your phone.

“Layers. Onions have layers. Ogres have layers. Onions have layers. You get it? We both have layers.” (Shrek)

Here comes Kubernetes

We knew we wanted to run everything on top of Kubernetes, but we had internal debates and proofs of concepts to decide whether we wanted to install our own Kubernetes control plane or, for example, use federated clusters. In the end, we decided to outsource all this pain to Amazon for a total of $70/month per cluster and go all-in on EKS. It was worth every penny!

A common misconception about Kubernetes is that with the availability of EKS or Google Cloud’s GKE, you can just install it and then launch your workload. While this might be true for a sandbox environment as AWS takes care of the (otherwise difficult) control plane installation/upgrades and container networking, there are a lot of extra pieces you still need to figure out.

In order to use Kubernetes in production, these are the actions we undertook as a team:

  • install and configure the cluster autoscaler and vertical pod autoscaler — and make sure they were correctly tuned
  • set up the AWS Application Load Balancers and wildcard ACM certificates to provide a strict HTTPS — only ingress
  • set up logging: fluentd, fluentbit shipping logs to S3; querying and searching them with AWS Athena and Jupyter notebooks
  • set up monitoring and observability: we use Prometheus and Grafana for metrics and alerting, and have our own end-to-end test to validate each cluster’s availability and reliability from the customer’s point of view
  • deploy secrets from git in a secure way
  • add smaller, but important pieces like the Cloud Custodian, detect if we’re running out of AWS quotas and automatically open a support ticket to raise the number of EC2 instance limits, VPC limits, etc.
  • add support for graceful termination for spot instances (or, in fact, even for normal machines, with our in-house node drainer app written in Go)
  • set up drivers for GPU or Elastic Inference instances, allow connection to services running in another AWS accounts and VPCs, etc.
  • …and finally, the key piece to Unicron: set up ArgoCD and Argo Workflows
The bits and pieces of the Unicron platform

The ecosystem

We rely heavily on the Argo ecosystem and therefore on the GitOps methodology. Our customers write their own tasks, Spark jobs, etc., in their preferred way and then feed them to our internal helper tools. These tools generate a Docker image and create a commit to the customer’s git repository. ArgoCD detects this change in the repo and automatically deploys to the respective cluster. This also makes it very easy for our customers to replicate the same behaviour across multiple environments by promoting the same commit to a different branch of their repository.

Everything in the Unicron ecosystem is fully decentralised; each customer lives and works in their own private environment, strictly isolated by IAM roles, VPN restrictions and SSO authentication. All of these lists are self-service, meaning the customers can edit their own access lists and commit them to git, then our bells and whistles will take care of the rest.

The clusters and their EC2 instances are expected to tolerate complete redeployments, rolling node replacements and live control plane upgrades. They either do these transparently in the background, or in case of a redeployment they pick up the work where it was left off before deleting the cluster (redeploying safely stored Kubernetes Secrets from git, launching jobs, etc.).

There’s no system without technical debt. Although we’ve set up very strict quality requirements, it’s not always possible or realistic to go all-in and end up with a perfect solution. If we have to compromise on something, we make sure it’s agreed and reviewed by the majority of the team. After documenting the solution, we create a tech debt GitHub issue so we don’t forget to fix it.

Every quarter we have a dedicated two-three week window to fix technical debt. We open a discussion to vote on the most pressing issues to be fixed or improved and then get it done step by step, making our systems, life and sleep better every day.

The community behind Unicron

One of the best things that helped evolve our internal product was setting up a strong community behind Unicron and its different services. This not only offloads work, but also provides a channel to share ideas and problems, spark innovation, get constant feedback about features and agree on the direction we should follow.

We built thorough documentation for all of our infrastructure, with a lot of examples, best practices and recommendations. Additionally, each time we onboard a new customer, we start with a dedicated workshop to get them up to speed quickly.

We discovered we needed two different communities based on the main focus and interests of different teams. So we now have Machine Learning and big data communities, where we hold bi-weekly sync sessions to discuss current progress, and internal meetups, where team members or customers can present their latest project or idea. We share the slides and recordings of the sessions so others can also recap.

If we have a lot of people contributing even just one rock at a time, we end up with something amazing!

Do you want to be a part of this journey?

Thanks to a lot of love, passion and great team spirit, we came up with technical solutions and interesting engineering ideas while building Unicron. We could write page after page to tell you more about it, but wouldn’t it be better if you saw it yourself? Come on and join the team — we’re hiring! 😉

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store