DevOps as Contract

Published in

GoEuro Engineering

7 min readJun 5, 2018

In the past four years, GoEuro has rapidly grown to offer train, bus and flight search and booking for more than 80,000 destinations across Europe, and serving 20 million customers every month. For our engineering team to support such rapid growth, having scalable DevOps is key. Unlike many tech companies, we do not see our DevOps as a team or a toolkit when scaling it up, but rather as a contract.

What is DevOps Contract?

In the ‘real-world’ the concept of a contract is nothing new: an agreement between several parties, enforced by law. Similarly, a DevOps contract is an agreement between consumer services and provider services, enforced by test cases. At GoEuro, we use such a contract with the goal of making our DevOps much more automatable and scalable. So far, we have implemented contracts in the following areas:

Continuous Integration (CI)
Infrastructure
Routing
Logging
Monitoring

All DevOps contracts are developed under the following principles:

A contract must cover the whole development cycle, from development to production, which is also our view on DevOps.
A contract must be simple; in general, the less configurability, the better.
Test cases are the primary documentation for a contract, which must give engineers a clear understanding of how it works.
We eat our own dog food. We don’t build something only for others. We build and use it first for our own use cases before shipping to other teams.
Each contract must have its own use case and independent customers. It should not be just a ‘pipe’ to another contract. We do not want fancy architecture with ‘fake’ component boxes and arrows.
The contract should not be static, but evolve on demand. Compatibility must be respected, but for a defined time.

By implementing contracts, we purposely reduce the flexibility of our tools and services. This seems counterintuitive but in fact makes sense; Swiss army knife-style flexibility is often a desirable feature for engineers to build things, but such flexibility can be a nightmare for DevOps when there are hundreds of different services in production (e.g. dependencies become hard to track and processes become difficult to automate). So with DevOps contract, we stay away from Swiss army knife, and build factory pipelines instead, which in turn makes it possible to automate and scale significantly our DevOps.

For DevOps, simplicity scales.

DevOps contract turns flexible tools that are hard to automate into simple yet scalable pipelines (Icons created by Ben Davis and Laymik from Noun Project)

For the rest of the blog post, we will present the improvements we had on Continuous Integration, infrastructure, and routing by implementing contract.

Continuous Integration

Before implementing the contract, Continuous Integration (CI) at GoEuro was treated as Jenkins-as-Tool. For only eight services, we had huge, complex CI jobs. Those jobs were partially configured first by application teams then by DevOps, without clear boundaries. We had several dedicated release managers who needed to maintain a ‘mind map’ of all releases: branches, versions, and parameter permutations. At the same time, engineers had to ping DevOps teams for each release. In addition, many different CI plugins were installed for different teams. As a result, almost every Jenkins or Jenkins-plugin upgrade broke some other job, causing problems for everyone. Agent configuration, auto-scaling, and job execution were also problematic. We will share more details on those areas in future blog posts.

We tried Jenkinsfile to mitigate the problem but this wasn’t successful, as our mindset remained the same — we still saw CI as a tool. We had a huge number of repetitive configurations with same random plugins, and faced the same problems as previously. At that time, we could only have six to eight releases per week, which clearly wasn’t enough for our rapidly growing business.

Then we moved to CI-as-Contract. On one hand, application teams just specified pipeline job with container image, scripts, and manual checkpoints in a simple YAML file. On the other hand, DevOps was responsible for implementation, and focused mainly on improving infrastructure and DevOps environment without dependencies on other teams. Features like build caching, notifications, autoscaling agents, analytics, organizational context, auditing and much more were progressively and automatically added for everyone using the contract. With this, we knew exactly how CI can and cannot be used, which made it possible to upgrade the entire CI infrastructure without breaking existing jobs. Today, we make more than 600 releases and 2000 jobs a week, spawning more than 150 short-lived VMs on-demand every day, as shown in the figure below.

Infrastructure

For a long time before adopting contract, we had leased machines with centralized configuration management using Salt, and manually packed and scheduled services on VMs. Those packing configurations were heavily customized to requirements of specific services, and were further fragmented to accommodate CI and other environments. Engineers needed to follow an elaborate process to make any change, which became a constant stream of work that took priority over more important tasks. And not surprisingly, this also made end-to-end testing slow and complex.

In late 2015, we started using Kubernetes. Defining physical resources as an API was great. But Helm, the tool for managing Kubernetes deployments, became the new Salt: Setting up a service still required specification of many variables and configurations, and end-to-end testing was still an elusive dream. Then, engineering teams started self-deploying customized configurations without health checks, rollout policies, and resource limits. Our infrastructure was still a wild-wild-west, and we were struggling to scale beyond 8 services.

Then we redesigned Helm-as-contract with a heavily restricted scope (via our whitelisting contract), and we started treating Kubernetes only as a contract. We allowed each team to declare what resources they needed in common Kubernetes and Helm terminology in their own repositories, and we wrote an API that automatically fetched, generated and orchestrated configuration for any service with one variable ‘environment’ specified. With this contract, every engineer in the company could run any service just by knowing the service name and environment name. This allowed engineers to run any subset of GoEuro services in minikube (developer VMs), hyper-VMs (our development VM images on the cloud), end-to-end tests, CI, all the way up to QA and production. All the permutations of variables and configurations disappeared as well. In addition, this contract brought other features, such as the ability to lint, instrument and modify all resource configurations in any environment, resource usage validation and restriction, and health checks.

Our first production cluster was created by the end of 2015. Since then we have upgraded our entire infrastructure from Kubernetes 1.2 to 1.10 without breaking any services. Modelling helm as a contract allowed us to trace, enforce policies and prevent snowflakes in one single place where the contract is implemented, and give a streamlined, simplified workflow to all engineers. Today, we have more than 300 services, and it only takes 30 minutes to bootstrap a new service from zero to production.

Improvements on infrastructure from adopting contract (Icons created by Icon Solid, Royal Icon from Noun Project)

Routing

When GoEuro started, we had only a few services, one router handling all traffic, and hardcoded service discovery. As we added more services, we created many custom and inconsistent routing rules, nested rewrites and redirects, and randomly captured URL paths. This led almost every service back to carry custom nginx forwarders inside their containers, which made routing configuration within each service a black box and our routing graph like random. Even worse, we didn’t have infrastructure as contract at that time to limit the usage of VMs so different groups of VMs also carried their own proxies. As a result, we couldn’t even scale over 10 services, since these 10 already carried an unmanageable routing graph.

Then, we adopted Ingress-as-Contract with heavily automated yet restricted rules. Services were not allowed to capture “random” routing paths anymore, and every service was assigned a fixed route based on its service ID. Instead of making a router that is too big to fail with dozens of routing possibilities, we made it heavily simplified and to work from development to production. The implementation of this ingress controller became secondary.

With this routing contract, we quickly had consistent and predictable routing. On top of this, we added more features such as automatic logging and monitoring, global health and SLA checks, extensive instrumentation, scalability, load balancing, edge gateways, cross-zone failovers, dashboards and network policies. Today, we have more than 50 million requests routed through 300 services every day.

What’s next

So far, we have automated most critical and routine DevOps tasks with the contract-centric approach. Given the significant amount of effort we have spent on automation, we are seeing (naturally and not surprisingly) a decreasing gain in productivity from spending extra effort. In a fast-changing environment, there will always be some last-mile tasks that are not automated. For those tasks, we collaborate with application teams in a non-contract-centric way, although instances of this are getting fewer and fewer over time. And of course, we are still automating last-mile tasks that occur often enough.

*Diminishing return on productivity as effort on automation increases*

In the end, the goal of DevOps at GoEuro is to transfer values created by engineers to our customers in a fast and reliable way, and DevOps-as-Contract helps us to become much more efficient at this. For the future blog posts, we will dive deeper into the technical details of our usage of Contract in CI, infrastructure, logging, routing, and etc.

Interested in solving this type of problem? Join us!

– Subhas Dandapani & Boxun Zhang