Criteo Engineering
Oct 23, 2019 · 4 min read

Cloud Native 2019 took place in London and gather almost 200 hundred cloud practitioners. It has been an opportunity to exchange point of view on various topics, to extensively discuss testing in this context, and to share feedback on such technologies.

Public or private cloud?

This question was the topic of the opening keynote. It was also a question we have been asked many times at our booth. The general consensus was that nowadays it makes more sense to use a public cloud (AWS, Google Cloud Platform, … ). There are some good reasons to be reluctant to use a public cloud (like the collection of sensitive data). But there are also several injustified reasons. For instance, several speakers mentioned the fear be locked-in within a cloud provider and then to experience a price increase. They dismissed this argument based on price evolution data from the last decade. Yet, this data shows that the situation never occured.. And there are several good reasons to use one. In particular not having to maintain a stack of technos not directly related to one’s business -and the subsequent security risks- as well as the velocity it brings to be able to benefit from an actual prod-proof cloud.

We had several questions on that matter since we’re managing our own private cloud:roughly 50.000 servers and several dozens SRE engineers to maintain it. Actually Criteo is a bit older than AWS so when we started, developing our private cloud was the only option. We started with bare metal and, as we grew, developed expertise on developing and maintaining our private cloud. Now, considering our scale, tooling and expertise, having reached a high level of optimization, it turns out to be cheaper and more efficient to manage our own infrastructure.. Yet we are open to other approaches, and we also use a public cloud for some use cases that make no sense to internalize, for example this is the case of some specific tests and caches.

At Criteo, performance is everything. It implies working smart with the best infra we can afford. It is sometimes public cloud, but as of today we’re mainly focused on our own private cloud.

Testing

Interestingly several talks broached this topic at some point. The cloud brings agility but if the testing strategy doesn’t evolve it can hinder this benefit. Also, with microservices new difficulties arise. For instance, we don’t want developers to have to spawn dozens of services on their own computers just to be able to run the regression tests locally.
Several aspects were discussed, in particular:
- consumer-driver contract tests, to be able to test a service without having to spawn the whole infra-structure
- testing in prod by first deploying on a trafic-less prod server -after all the usual regression tests in pre-production have been done- in order to be able to test the integration with the other services in conditions… as close to prod as possible! A little regret on that though: the duration of this talk was a bit short and we didn’t have much time to talk about handling side effects in this context.
- testing issues we could never have imagined could occur in prod, thanks to Disaster Recovery Testing games, chaos tests, and a solid monitoring

My team, Test Services is in the process of redesigning the way we’re doing end to end tests at Criteo. Historically we developed sandboxes: an isolated environment used to run tests in our build pipeline. Those environments replicate a lightweight Criteo datacenter on a few servers. It was quite useful back in the days when we had only a few services but it turned out to be more and more complex. And more and more painful to maintain. We’ve hence been in the past quarters, in a process to re-thinking the way we’re tackling this. Those talks have hence been an opportunity to step back on what we’re doing -and to give us new ideas to go one step farther.

Operations

One of our Criteo’s was also doing a Lightning Talk: “Discovery, Consul and Inversion of Control for the infrastructure” (slides are available here). During this talk, he explained how infrastructure is evolving every day faster and how mixing several kinds of architectures (real datacenter, containers and various Cloud providers) is hard to handle to an infrastructure point of view since everything is moving faster and faster.
He also explained how Criteo implemented a new pattern of architecture for infrastructure called inversion of Control. Which such pattern, infrastructure becomes a live database of all workloads and by enriching the running services, so it becomes possible to decouple all tools, innovate faster and create additional value very easily to match business needs.

Criteo is using this pattern to perform all of its load-balancing, generated automatically alerts, track versions at large scale and with vendor-independent schedulers, meaning hybrid cloud/containers/legacy systems.

Being able to structure your data, bring semantics to infrastructure is probably one of the big challenges of next few years in the era of microservices and Criteo is pioneering it with interesting success.

Long story short: Cloud Native 2019 was a quite interesting place to be.
See you next year!

Authors: Guillaume Turri, Pierre Souchay and Clément Boone

Criteo R&D Blog

Tech stories from the R&D team