Skyscanner’s journey to effective observability
The year was 2020 and Skyscanner, like the entire travel industry, faced unprecedented challenges due to the global COVID-19 pandemic. Yet, this difficult year also provided an opportunity for introspection, prompting us to enhance our tools and processes to emerge more resilient than ever, to be the world’s number one travel ally. This is where our journey to completely revolutionise our approach to observability begins.
The image below shows a “simplified” view of what our internal observability platform looked like at the time. As you can see, there was some room for simplification. This platform contained a mix of specialised vendors for RUM, tracing, or synthetics, and a large number of internal systems based on open-source backends like OpenTSDB, Prometheus, or multiple ELK stacks.
However, our challenges were not simply related to cost, or the complexity of running this platform with a small team. We understood that our most important problem to solve was improving the confidence of all engineers to understand and operate their services, to reliably connect more than 110 million users to over 1,200 flight, hotel and car hire partners each month. This required an observability platform that would…
- Reduce cognitive load and context switching for engineers, with one single platform, and one single telemetry language.
- Correlate traces, metrics, logs, and events, across multiple services and frameworks, from client devices all the way down to Kubernetes containers. Our components don’t work in isolation, and neither should the signals we use to observe them.
- Optimise the cost and quality of the data produced, storing the data we need to operate our systems reliably, and no more. Meaningful, contextual data can be cheaper than low-quality, verbose data.
- Implement open standards to future-proof our instrumentation and transport layers, to ready our tech stack for changes in the overall industry while reducing maintenance overhead.
If you know anything about observability, you’ll know that none of those are trivial problems to solve. We didn’t need a lift-and-shift, we needed a mindset shift.
OpenTelemetry and New Relic, central pieces in our cloud native strategy
Solving the challenges above required a long-term strategy, and set of a guiding policies. So, we got to work, and after weeks of PoCs, reviews, and prototypes, we defined a set of individual strategies based on two major principles:
- One single standard to instrument our services and transport our data, OpenTelemetry.
- One single backend to store and analyse our data, New Relic.
Skyscanner runs on cloud native tooling. We have a team of world-class platform engineers contributing to several CNCF projects like Kubernetes, Istio, or Argo. When a new kid on the block called OpenTelemetry started to make noise back in 2019, we listened. All our distributed tracing was already based in OpenTracing, now deprecated in favour of OpenTelemetry along with OpenCensus and other standards like ECS (Elastic Common Schema) — take that XKCD 927, we’re a net 2 positive! –. This was a logical next step for tracing, aligned with our cloud native ethos.
But we didn’t stop there. With OpenTracing, we had previously experienced the benefits of decoupling cross-cutting APIs and their implementation, allowing us to switch vendors with no changes to instrumentation code. OpenTelemetry follows this same client design principle. So we knew OpenTelemetry was not only going to be the next step for tracing, it was going to change the whole observability industry. We’re all in when it comes to the vision of high-quality, standardised, portable, and ubiquitous telemetry provided by OpenTelemetry. We decided to double-down, and to base our strategy in the use of these open standards for context propagation, semantic conventions, transport protocols, and APIs across traces, metrics, logs, and baggage. The future is OTel-native, not APM agents, and we’re ready for it.
We wanted to use a single observability vendor that could support the capabilities we required under a single platform, but we didn’t want to compromise our stack to the future of observability. This was one of the main reasons that made us choose New Relic as an observability platform, and commence a partnership. A platform that could ingest telemetry using the standard OTLP protocol, and use OpenTelemetry semantic conventions to provide enhanced analytics on top of our data. This allowed us to start relying on open standards while OpenTelemetry stabilised tracing, metrics, and logs, while complementing with New Relic instrumentation in limited places where we felt OpenTelemetry was not yet ready at the time, like mobile or browser. We’re now closer than ever to achieve our North Star architecture illustrated below.
Mindful migrations for higher return-on-investment
Our motto for Skyscanner platform engineering is the following: make the golden path the path of least resistance. If the easiest way is also the one that follows best practices and engineering standards, why would you do anything different?
To help with this, at Skyscanner, we provide a set of core libraries to automatically configure certain aspects out of the box, like security, identity, or telemetry SDKs. This is where we implement opinionated defaults. For instance, how we aggregate data — cumulative temporality for Prometheus or delta temporality for New Relic — or how to export that data, which in our case is a centralised Collector Gateway.
To execute a migration, an API design like OpenTracing’s (and OpenTelemetry’s) makes things a lot easier. You can simply swap the underlying implementation with no changes to instrumentation code. This allowed us to migrate over 300 microservices in a matter of weeks, only requiring service owners to bump the version on one of those core libraries.
Other migrations, like logging, required some extra config in each service to be changed. Here, we used one our of own open-sourced tools, Turbolift, to automatically create over 1,000 pull requests to different repos and change their appender settings. Finally, in cases were we still had to rely on New Relic SDKs, we did so by providing very thin wrappers, with the intention of facilitating a final move to OpenTelemetry APIs when ready.
However, we knew that a lot of the data produced by our services was not being used by their owners. This is where a less popular side-effect of OpenTelemetry semantic conventions for Resource attributes helped. We now know exactly the service, account, namespace, or cluster that telemetry is produced from, because those attributes are now standard across all signals. With this, we can distribute telemetry costs back to service owners, so they can understand the cost that their telemetry produces, and provide guidance for better return-on-investment.
And then, something magical happened: we had teams that wanted to find more optimal ways of using telemetry. We taught them about distributed tracing, how it provides much more granular and high-quality insights into our systems than logging or metrics, and we explained the advanced ways of sampling that it can enable, to store only data about the transactions that matter — the slowest ones, or those that had an error somewhere in the stack. This allows us to store about 4% of the 2M spans and 80K traces we produce every second, without losing any of the data important for debugging. When they saw the advantages, they were convinced, and started to rely on tracing rather than verbose logging or high-cardinality metrics. This made some teams reduce their telemetry costs by over 90%!
Due to this cultural change needed in how we approach telemetry instrumentation, not everything could be a version bump or an automated pull request. We asked service owners to evaluate the telemetry they produce, use automatic instrumentation provided via OpenTelemetry if possible and, only if needed, rely directly on the OpenTelemetry API to instrument custom aspects, using the right signal (metrics, traces, or logs) for each use case.
Finally, when thinking about dashboards and alerts, we also applied the principles of reusability and modularity we would apply to any software we write, reducing cognitive load and maintenance toil for service owners. Thanks to Terraform modules and common resources, we can provide pre-canned alert definitions, or standard dashboards that every engineer can re-use for their owner service. Paired with the use of Atlantis, this allowed us to follow standard CI/CD workflows, with changes to alerts and critical dashboards being reviewed as code. This has improved the quality of our alerting and dashboards, with no more unwanted changes.
From the technical to the sociotechnical
I could (and I did) write pages and pages about the different technical decisions we made, and why we made them. However, the most difficult aspect to change in a system is not technology, it’s humans. All this observability data would be fairly useless if it’s not being used by people to ultimately improve how travellers experience Skyscanner.
One aspect of this cultural shift is changing how engineers approach the monitoring and debugging of a system to use observability tooling effectively. Humans are creatures of habit, and when a team has been operating with a certain style of runbooks for years, it’s difficult to show them that there may better ways. That there are unknown unknowns lurking in the system that can render their runbooks useless when being paged in the middle of the night. Using tooling provided by observability platforms, like New Relic in our case, can give engineers access to more advanced analytics to automatically correlate anomalies, forecast trends, or profile errors. This can help them find regressions, or optimisations, relying on facts rather than intuition and prior knowledge.
To tackle this cultural change, we kick-started an initiative called Observability Ambassadors. We wanted to bring the best practices in observability to the application domain. These ambassadors are proficient in the use of the OpenTelemetry API and instrumentation packages, and help bridge the gap between observability engineers, experts in transport pipelines and SDK config, and developers in their corresponding teams.
Observability Ambassadors help to advise others and drive initiatives. This becomes a lot easier if you can make it fun! Last year we started hosting Observability Game Days using the official OpenTelemetry Demo, to gamify system debugging and show engineers how OpenTelemetry instrumentation, and advanced tooling, can help them to understand their systems better.
The final piece of the puzzle, and one that cannot go amiss, is connecting telemetry back to Skyscanner travellers, and to business value. This is where SLOs come into the picture. At Skyscanner, our approach to SLOs had been based purely on RED (Rate-Error-Duration) metrics in the past. But in reality, do travellers care about API response codes? Not really. With access to client-side telemetry, we can drive SLOs from signals that relate directly to our users, like “how many flight searches displayed valid results?”. Most importantly, thanks to distributed tracing, we can now understand what services are part of the critical path for a given experience. This helps set appropriate SLO targets, aligned across services, to meet our overall commitments to our travellers. All part of an interconnected system.
This has started a fundamental change in how we look at cross-domain dependencies, and how we approach discussions between the different teams required to provide a reliable service. We’re using observability not only as a technical tool, but also as a sociotechnical tool, to help us reason about our system and make data-driven decisions. We base our commitments on evidence, not intuition.