Microservices and containers multiplied the complexity of Ticketmaster’s software system. Its engineers solved their debugging problems with Jaeger, an open-source tracing tool from Uber incubating at CNCF.
At a glance
Ticketmaster’s software system:
- Billions of transactions per day
- Some remaining monoliths
- Over 300 microservices (and other services)
- 1,000 people on Product and Engineering
- Debugging a mission-critical, revenue-generating system efficiently with microservices
- Onboarding developers and quickly bringing them up to speed with the system map
- Finding an alternative to log aggregation for debugging since the latter does not scale for a concurrent system generating terabytes of log data per day
- Jaeger tracing
- Complete visibility into 50 services run by seven teams in three locations
- Fewer team members needed to find the root cause of issues
- Faster onboarding of new employees, especially those joining on-call rotas
Ticketmaster Entertainment, Inc. is a ticket sales and distribution company based in Beverly Hills, California, with operations around the world. It was founded in 1976. In 2010, it merged with Live Nation to become Live Nation Entertainment.
Ticketmaster’s distributed software system is a superhighway for billions of transactions per day. It consists of over 300 microservices, plus other types of services and monolithic applications. Smooth operations are crucial to its online sales platform and phone-call routing system. Latency or bugs can easily cost the company ticket sales and revenue.
The Ticketmaster team found traditional logging insufficient to monitor and debug their increasingly complex software system. They had broken monoliths into hundreds of microservices, migrated workloads to the AWS cloud and deployed to the Kubernetes platform for container orchestration. The scale to which the system had grown dwarfed anything team members had dealt with in the past. How were they supposed to get visibility across all those disparate systems and repair issues in a reasonable time frame?
Related to that was the fresh challenge of training employees adequately for this ever-evolving system, particularly in terms of onboarding and on-call processes. The growing business was hiring aggressively, and found it difficult to quickly integrate new developers without hiccups. It also desired an up-to-date visual diagram of the system to help engineers get their bearings around the system and start contributing code as soon as possible.
Ticketmaster Engineering fed both birds with one seed when it decided to adopt Jaeger for end-to-end distributed tracing.
Previously, the company leveraged its own correlation ids to connect the dots in its system using logs. However, increasing concurrency demanded some centralized software that could respond and scale on its own. Louis-Étienne Dorval, Lead System Engineer at Ticketmaster, remembers realizing that logs were no longer a viable solution.
“At a scale of terabytes of logs ingested per day, there was a key moment when we decided to look for an alternative. I used to look at a 20-lines-long logs query with tens of regular expressions, joins and searches just to be able to aggregate the logs from ten services in an ordered way with our correlation id. There had to be a better solution.”
Kraig Amador, Senior Director at Ticketmaster, set out to find a solution that would help it visualize the entire software system and connect the dots.
After scrutinizing the options on the market, the Ticketmaster team opted for the Jaeger tracing platform.
They loved that the tool was open-source. It meant they could see exactly what the code was doing, and could also contribute features as necessary. Also, for an organization of its size, the adoption of expensive proprietary software could be costly and time-consuming. With Jaeger, they could start small on top of their own infrastructure without big upfront commitments.
According to Amador:
“Jaeger Tracing is helping us achieve our vision of observability capabilities across different versions of our platforms, how they integrate with each other and how they have grown over time.”
Ticketmaster started by instrumenting the infrastructure layer in the application, and achieved consistent visibility across services faster as a result. Then a slew of other benefits followed.
Jaeger is designed to provide a macro and micro perspective of software systems.
Ticketmaster engineers were thrilled with Jaeger and its ability to visualize their whole system. They can now see the request flow with a DAG view and a Gantt Chart annotated with info on SQL queries executed, latency information and trace diffs. Trace diffs, which Jaeger recently launched, compare the structural aspects of two traces. It emphasizes the differences on execution trees with a color-coding system on top of the spans chart.
Sampling is a technique used to record only a subset of all traces to keep storage bills manageable. Jaeger’s Remotely Controlled Sampling allows central configuration and management of sampling strategies. Ticketmaster relies on this feature to adjust sampling on a per-service basis at runtime so that teams need not redeploy to adjust it everytime.
Jaeger’s Adaptive Sampling is an advanced type of Remotely Controlled Sampling. It boasts two unique features regular sampling does not. It a) guarantees at least minimal traces are collected from service endpoints with low QPS (Queries Per Second); b) allows fine-grained control of sampling strategies on a per-endpoint, rather than service-level or global, basis.
Jaeger incorporates Kafka-based ingestion so users can build data-mining tools. For example, they can build real-time service graphs using Apache Flink streaming jobs on top of tracing data, something Ticketmaster has experimented with. This was not possible in the past because no other technology could replicate data captured by traces.
Jaeger helps the Ticketmaster team surface the right information quickly. There are features in the UI, such as “Custom Tag Links,” for easier search. When there is an outage, they can locate a trace that displays tags such as “Product Code.” One click on a tag shows info about the service — like who’s on call for that service, which Slack Channel to use, etc. This reduces time to resolution when issues arise.
Onboarding, workflow and on-call processes for Ticketmaster teammates run smoother now, thanks to Jaeger. It is now an integral part of the engineer-onboarding process. It provides immediate access to the set of dependencies between systems and services, helping new hires understand the system architecture. Jaeger is becoming like an ever-updating system map for the organization.
Ticketmaster now has over 50 services instrumented and nine groups of engineering teams actively using Jaeger. As a result, the company has significantly improved its on-call processes and outcomes. Louis-Étienne Dorval and the whole team’s satisfaction with on-call processes and workflow shot up.
“In a distributed system, often there is a rather complicated chain of service-calls. With Jaeger, the on-call engineer is able to reach the possible root cause of the problem without needing to reach out to developers responsible for each step of the chain. For instance, if the problem occurred because of the 5th service in a hierarchy of service calls, Jaeger will visualize that, and the on-call engineer does not need to page anyone but the person responsible for that service.”
Ticketmaster believes other enterprises with distributed systems could definitely see similar benefits from Jaeger. The team recommends that folks considering Jaeger, and tracing in general, start small, measure progress and learn as they go. For Dorval, culture is key.
“Build the culture first. Start small, get people onboard, identify possible evangelists and move forward step-by-step, showing the results and using the lessons learned the best you can.”
Amador advises practitioners to welcome in tracing as an indispensable tool for running software systems at scale. Tracing, logs and metrics form a potent monitoring-and-debugging trifecta for today’s complex distributed IT.
“Just as with logs and metrics, I have never heard of a good reason for not doing tracing!”