State of load balancing at Criteo

Like many internet companies, Criteo is actively driven by its capacity to scale its infrastructure and answer requests as fast as possible. The first element of technology that makes this possible in large infrastructures are load-balancers. They help to expose network endpoints to the outside world, but also internally. About two years ago we started to dust this branch of the company which needed a large ramp-up. Today we are happy to share the results as it seems to be the end of the first step of our journey and the beginning of a new one.

William Dauchy
Criteo Tech Blog
8 min readMar 19, 2020

--

Manual era

Before creating a dedicated team for load balancers, a lot of things relied on humans. To better understand where we were coming from, we are giving a quick overview of how things were done a few years ago.

Ticket based Virtual IP (VIP) creation

For a time, creating a new entry in our load balancers was manual and changes were not frequent. The infrastructure was somehow manually manageable and the backends were bare metal-based, which meant they were not changing often enough to justify automation. When we needed a VIP, a developer would create a ticket, a network engineer then created the resources and pushed the configuration to the different load balancers. Of course, it was open to human mistakes, and the process was not very smooth.

Feeding the backends

We were using config management for servers to deploy automatically. We used Chef, written in Ruby, for which we do a pretty decent amount of contribution, and which allowed us to do advanced stuff such as PXE reboot servers, install multiple OS starting from an in-memory image.

To add a node as a backend of a VIP, tools were built to do a chef search, and get all the corresponding nodes. This process had to be triggered manually and was mainly adapted to physical machines. The drawback was, that you had to think about scheduling the tool each time you were adding machines for an application. This process was more or less acceptable because the addition of new machines was happening in very specific periods of the year and, more importantly, the number of new services was limited to a small set.

Things became trickier when PaaS became a thing at Criteo. With PaaS, each task sat with a container, which needed to get one or more network ports to be able to communicate, associated with the IP of the machine hosting them. When the task was killed or rescheduled, the IP and ports may have changed, because in this deployment model we shared a common IP stack. To share those applications with the outside world behind a unique network endpoint, we needed a way to gather all the modifications of this platform and notify the load balancer.

Abstractions everywhere

We had a working solution that was used for a few years at Criteo. Previously, you learned it was involving manual steps, especially while creating a VIP. All those steps were involving humans creating some drawbacks: It was time-consuming, not enjoyable for the people involved in these operations and error-prone.

VTeam

At Criteo, we have a handy opportunity to propose what we call a VTeam, for Virtual Team. When an issue is identified that impacts a certain amount of people and we see the opportunity for optimization, it is possible to write a document proposal to resolve this issue. When validated, you can join a temporary team, for a quarter usually, and sometimes more. It gives us the freedom to be part of a specific project or task without interruptions. Sometimes, when a project is successful and proves its dedicated value, it can be transformed into a real team; otherwise, the people involved get back to their respective team.

VaaS for VIP as a Service was one of them. The main objective was to fully automate VIP process, making it possible to set up new VIPs in seconds while using a self-service approach.

Consul as a source of services events

The possible backends for our load balancers were both physical machines and containers. The good news were, we already had a service registration in place on both environments using Consul. It was a natural and unique source of data for this project.

Every time an event was triggered, catalog configuration was dumped and sent to a process with two roles:

  • Add the remaining information needed to configure the VIP: some information is coming from different sources, e.g. provision needed IP addresses, expose available TLS certificates
  • Provision each load balancer equipment

At the end we implemented different separate components in order to extend it as needed in the future:

  • One component responsible to watch change through consul; the later exposing “enriched” data with enhanced and external information needed such as:
    * TLS certificates
    * Allocating IP
    * Writing DNS records
  • One component per load balancer implementation responsible for provisioning equipment. This component watches the changes exposed by the enrichment brick. The implementation differs depending on the equipment (e.g F5 and haproxy), meaning we are able to add any new load balancer stack transparently to the user. We will discuss how we transitioned from one stack to another in the last part of this article

Those components had to support > 1000 services per datacenter. For each service, regular changes had to be taken into account. Containers based applications were having regular changes; on some data centers, we saw several changes every few seconds. This represented significant traffic and led us to optimize a few things in order to avoid broadcasting all data too frequently.

At that point, we were able to provide any type of application present at Criteo in under a few seconds following its registration in Consul.

Consul as a source of data

Even though this might be discussed longer, Consul K/V store or service metadata (link to the API) is a very powerful mechanism for users to request additional services or to tune default behaviors. There are a lot of possible things you might want to associate with your VIP for configuration:

  • HTTP to HTTPS redirect
  • Sticky session enabling
  • DNS naming
  • Any HTTP-related configuration

Consul as a state reference

Once we exposed how we were using the Consul service catalog as a source of data, it became natural to use health checks provided by the same system.
That’s why we introduced service health to enable a unique point of view from multiple possible health checks done by Consul. It works as follows: Load balancer queries the Consul endpoint present on the service hosting the machine and determines the resulting health of the given service:

As you’ve seen our load balancer control plane now heavily relied on Consul. From our point of view, it allowed transferring the ownership of the VIP load balancer configuration from human-managed assets to the user. Since they are the ones registering their application in Consul, they now also had full control of the application health and metadata.

Moving things into data centers

We haven’t mentioned it yet, but there was also ongoing work within the network team necessary for this shift. We were coming from a traditional L2 data center design. It caused limitations and we decided a switch to a CLOS matrix design. I won’t go into details as this subject is already well covered in other technical articles, but it was a good opportunity for us to re-think our load balancer design:

  • Avoid specialized racks: We historically put network devices in this area, with more network capacity and also avoiding any ECMP complexity
  • Being able to scale horizontally as needed, such as any other application

Outside of network concerns, we also wanted to address a few things regarding our load balancers:

  • Being able to fix long-running issues ourselves or ultimately in a reasonable period of time: We were based on vendor-locked solutions which showed limitations when facing advanced technical issues
  • Better control the overall costs

To resolve this in a few transparent steps, we decided to disaggregate our software load-balancing stack from an all-in-one model to the following:

  • L4 level would stay in those specialized racks as a start, allowing us to change the underlying level and tackling one issue at a time
  • L7 (HTTP + TLS) level would use our already deployed machines within the data center

At that moment we progressively switched our traffic from a vendor load balancer to HAProxy. The automation exposed above helped us switching the traffic from one component to another 100% transparently from our internal users and the outside world. We were now able to move things gradually thanks to the Consul indirection:

It allowed us a very smooth move from one technology to another.

Switching our traffic from a vendor load balancer to HAProxy

Even if, through this article, it is quite hard to explain all the steps necessary to transition from a fully manual configuration system to a fully automated one; we now have a solid basis and are more confident to move forward and evolve rapidly. A typical example is probably our agility to deploy any new version for any component of the HTTP stack, but also our growing ability to observe what’s happening on every part of our LB stack, to propose patches to the community, to rebuild internally, etc. Now that the L7 part is done, we are working on replacing and removing the specialization of the L4 part.

We aim to present more detailed aspects of our L7 stack and how we currently transition our L4 stack in the future. Stay tuned!

--

--