Weaveworks Combines Jaeger Tracing With Logs and Metrics for a Troubleshooting Swiss Army Knife
At a glance
- All cloud-native infrastructure running on Kubernetes
- Approximately seventy services corresponding to six-hundred containers
- Weave Cloud, an automation and management platform for development and DevOps teams
- Approximately twenty engineers
- Metrics provided only a vague, general overview of system; logging data increased storage costs, slowed programs
- Lacked a means to quickly surface individual requests
- Needed a more complete picture of system to fix issues like latency and slow queries
- Jaeger tracing
- Scalable, targeted tracing for quicker resolution of internal and customer-facing issues
- Ability to easily search for specific requests helps engineers zoom in on root causes
- Selective sampling cuts storage costs and prevents slowing of programs
- Tracing combined with metrics and logging provides a multidimensional view of system issues
Founded in 2014, Weaveworks Inc. makes software that helps developers and DevOps teams build, run and manage containerized applications on Kubernetes. Its products include GitOps-based cluster management, and application delivery, observability and monitoring solutions for services running on Kubernetes. It is a founding member of the Cloud Native Computing Foundation.
The company offers both free open-source and commercial software, as well as paid services, including: Weave Cloud, a platform as a service (PaaS) for deploying, monitoring and managing containers and microservices in cloud-native applications; Weave Scope, a real-time monitoring and visualization tool for distributed applications, available in open-source and hosted versions; Weave Cortex, an open-source Prometheus-based multitenant time-series database and monitoring system for applications and microservices; and the Weaveworks Kubernetes Platform for managing and automating clusters.
Weaveworks employs approximately 20 engineers who support around 70 different services deployed in over 600 containers. They must continually tune and troubleshoot its software system to retain monitoring customers spoilt for choice.
“Selling monitoring is quite tough; it’s a crowded market,” said Engineering Director at Weaveworks, Bryan Boreham. Monitoring-software users always seem to want more — more visibility, more information, more detailed answers. For example, in addition to a broad overview of their systems, customers may want a closer look at details like individual requests. If one provider can’t deliver what they desire, they may find another who can.
To better serve these customers, Weaveworks engineers sought a sharper lens for seeing small details. They needed to quickly surface specific information to help troubleshoot internal and customer-facing issues. They had a rather large collection of logging and metrics tools, but none that could readily zero in on particulars. The team finally found the versatile zoom tool they wished for in Jaeger for distributed tracing.
In the past, Weaveworks engineers often troubleshot system behavior through Prometheus metrics. However, many of those metrics were not particularly useful because their aggregate nature hides outliers and doesn’t allow engineers to rapidly zero in on specific problems within systems. “If you do a thousand really fast things and one really slow thing, then what’s the average? Even a histogram — it’s just not very interesting,” Boreham said.
Logging can also be practically useless in the event of an outage or customer-service issue. A logging system spewing out tons of identical statements does not help engineers quickly find a root cause. In the past, Weaveworks generated about 50 gigabytes of logs per day, most of which consisted of the same message from ten different components, according to Boreham. This is because log generation tends to be limited and hard to tune. “It’s almost infinitely likely that whoever wrote that bit of code did not log the one thing you need to know,” he said.
As an example of the need for greater visibility, Weave Scope shows diagrams of how components in distributed applications interact. It’s always been quite common for users to ask: “Can I see an individual request?”
Exposing such granular information from inside programs can be quite challenging.
Weave Scope features about 20 metrics, none of which provide visibility into individual requests. They offer only a broad, aggregated view of how applications are behaving.
Weaveworks engineers turned to tracing, hoping it would accomplish what logs and metrics could not. First, they instrumented their code with the OpenTracing API, and tried Tom Wilkie’s Loki, an opinionated reimplementation of OpenZipkin that used a pull-based model similar to that of Prometheus (when Wilkie joined GrafanaLabs, he redesigned Loki into a log-management solution). However, this method did not scale well, according to Boreham. This is partly because, like Prometheus, Loki was designed as a single-process architecture.
Then they found Jaeger, an OpenTracing-native distributed-tracing platform. Jaeger not only scales, it allows them to easily expose specific requests and speed up root-cause discovery. And its search feature enables them to ask their own questions pertaining to a unique case, not just the ones pre-programmed into a log generator. What is more, they can creatively combine Jaeger with logs, metrics and time-series data for enhanced views into system issues.
The engineering team easily set up Jaeger for basic instrumentation with just three or four lines of code, Boreham said; one additional line allowed for further inspection of SQL query execution. Weaveworks does not expose Jaeger tool directly to customers. Instead, engineers use it to troubleshoot internal as well as customer-facing problems. The company also leverages Jaeger to sharpen the monitoring capabilities of its offerings. For example, thanks to Jaeger, Weave Scope users can now actually view individual requests.
The versatility of Jaeger allows the engineers to not only isolate specific requests, but also examine patterns. These are not the mass aggregates and averages of their metrics tools. They allow them to view a manageably narrow set of spans within a trace to gauge how a service is performing.
Boreham mentioned these in a talk about optimizing queries with Jaeger and Prometheus at KubeCon + CloudNativeCon Europe 2018. He identified several interesting tracing patterns to pay attention to, including:
- The longest span, which is the best place to start looking for slow queries
- A gap between spans, indicating an additional span is needed to see what is happening within the gap
- A long diagonal group of spans, or “staircase,” which indicates serialized requests
- Many spans extending to exactly the same length, suggesting an artificial constraint like a timeout
- Lots of spans finishing at the same time, which suggests some sort of interlock
Jaeger can search through the past, react in the present and optimize for the future. After an outage or other incident, the team can perform forensics to discover the culprit; when the team receives an alert in production, they can use Jaeger to quickly drill down to the root cause; and in tasks of intensive optimization, they can proactively comb through data for possible points of improvement.
Engineers can tune Jaeger to investigate different types of issues. The remotely controlled sampling feature lets them choose how much data to collect from individual services; they configure sampling policies in a central location, and the Jaeger platform automatically distributes them to Jaeger tracers running inside the services. Multitenant-software providers like Weaveworks can turn sampling up or down on different customers’ data. This allows them to hit the sweet spot on different processes, user instances, etc., and render collected data in diagrams that convey useful information.
A Weave Scope user once complained that the software’s UI was slow. He also said that Weave Scope was chewing up resources on his system in the process of monitoring its data. Logging and metrics tools could not pinpoint the cause, so the engineers turned to Jaeger to trace the data in this customer’s system. At first, the traces returned about 30,000 lines on one operation, Boreham said. Those were too many to handle, so he tweaked the code so that Jaeger would receive less information in the traces.
“What I discovered, eventually, was that the system in question had 25,000 zombie processes on it,” Boreham said. Investigating further, he determined that the customer’s machine was iterating through a huge chunk of data. Boreham made three or four changes to solve the problem. He called the customer and explained the importance of periodically clearing zombie processes from their machine. The team also changed its hosted code to prevent this problem from recurring. Basically, information gathering is now capped at 2,000 processes per machine, since any in excess of that are likely zombie processes.
The patterns in tracing spans give engineers the “Goldilocks” view into systems — not too far away, not too close up. With Jaeger traces, Boreham’s team discovered the “staircase” pattern in Weave Cortex. They found that the querier was executing two 12-seconds-long requests, one after the other. Engineers made adjustments to the code calling into DynamoDB — the database Cortex uses at its storage layer — to parallelize these requests for greater speed.
Jaeger can also conserve storage space and spending. Constantly putting lots of data into logs can get expensive, and may slow down programs. Jaeger sampling captures an adequate amount of data without overloading storage with useless, repetitive information. Sampling and a 14-day retention policy for trace data have cut Weaveworks’ storage costs. The team can always turn the sampling rate up or down if need be. And no one on the team has ever complained about Jaeger slowing down a program, Boreham said.
Powerful Jaeger combos
The team often learns much better information through tracing than they do with their older tools; however, they have not thrown them out. They often combine them with tracing for a more complete picture of a system issue. For example, they still use standard text file logs, but now they put trace ids into logs in Cortex, and sometimes link the two together.
The team often examines tracing data and Prometheus time-series data side by side. For instance, while investigating latency, the team surfaced a Jaeger trace showing a slow call to DynamoDB; then Prometheus time-series data showed an ongoing load on that table. They fixed the problem by tweaking the provisioning of DynamoDB.
Boreham looks forward to the introduction of more advanced sampling features in Jaeger. For one, he would like to sample based on whether a trace is “interesting” or not, which, he admits, sounds a bit like magic. However, the tail-based sampling feature on Jaeger’s road map could deliver something quite close. While head-based sampling makes the sampling decision at the start of a trace, tail-based sampling collects all information — in nonpersistent storage for efficiency — and then decides if it’s worth sampling. If the trace is deemed “interesting,” it is sampled and stored. In this way, traces that are likely to coincide with system issues — those showing anomalies, unexpected error codes, etc. — are captured.
Other features Weaveworks would like to see introduced include:
- A means to intelligently search within trace URLs containing many lines of detail. This may become possible through Jupyter Notebooks; Jaeger wants to make traces work with this data-mining tool for user-driven custom searches and feature extraction (see issue #1639 for more information).
- A “cookbook” of useful ways to work with trace data in AWS Elasticsearch on which Weaveworks runs its Jaeger implementation (such as the feature extraction technique described in Chapter 12 of Yuri Shkuro’s book Mastering Distributed Tracing).
Jaeger provides a versatile means to discover the root causes of system issues. Targeted search allows it to display an individual request, so even small details can’t hide from it. And trace patterns and diagrams allow users to visualize data from different angles and in variable amounts. This is a big improvement over metrics that provide only a rough outline of complex systems and services. And unlike traditional logging, Jaeger allows users to collect data selectively and economically.
“It’s just a much better way to work. I can’t imagine working on build-and-run distributed systems without some way to figure out what’s going on after the fact. It’s just much better technology — doing that in the tracing style than in the old fashion of logging text files,” Boreham said.
Engineers can also use Jaeger in conjunction with logging, metrics and other tools. This enables them to come up with creative visibility and troubleshooting recipes on the fly. The variety of things that could break in complex systems meet their match in the variety of ways tracing can help fix them.
As Boreham concluded:
“If you don’t have a microscope, you can’t see bacteria; and if you don’t have a telescope, you can’t see Jupiter. Jaeger is like that — both the big picture and the fine-grained details at the same time.”