At a glance
Grafana Labs’ software system:
- Over 150,000 active installations
- Close to 30 engineers
- Hundreds of concurrent requests per second
- Running Jaeger for query optimization
- Lacked a means to visualize/follow the path of specific requests end-to-end to solve technical issues
- Needed to shorten time to resolution of customer-facing issues
- Jaeger tracing
- Guesswork replaced by targeted troubleshooting approaches
- Up to 10x better query performance
- Happier customers
Grafana Labs works everyday to break traditional data boundaries with metric-visualization tools accessible across entire organizations. It began as a pure open-source project and has since expanded into supported subscription services. The Grafana open-source project is a platform for monitoring and analyzing time series data. There are also subscription offerings such as the supported Grafana Enterprise version. Grafana Labs’ engineers service more than 150,000 active installations. Users include companies such as PayPal, eBay and Booking.com.
In 2017, Grafana Labs launched Grafana Cloud, a fully managed OpenSaaS metrics platform. The Grafana platform itself isn’t the only component that makes up Grafana Cloud. The mix includes Grafana Labs’ own Metrictank, a Graphite-compatible metrics service available in both an open-source and a hosted version, and Cortex, an open-source, managed version of Prometheus to which Grafana Cloud contributes. These tools are integrated in subscribers’ Grafana Cloud instances. Grafana Cloud engineers also use them to troubleshoot their own and individual customers’ technical issues.
All these components of Grafana Cloud make up a vast, varied and, at times, vexing software system. Its engineering teams’ existing tools did plenty of heavy lifting observing and monitoring the system. For example, both Cortex and Metrictank process hundreds of simultaneous service requests per second.
The teams at Grafana Cloud weren’t lacking logging, metrics and observability tools. But finding solutions to problems often required time-consuming guesswork and trial and error, because they could not visualize specific requests. They needed something to help them deep dive into and quickly troubleshoot highly concurrent requests.
Request latency was sometimes higher than desired at Grafana. High concurrency and opaque request paths made query optimization challenging. Cortex and Metrictank process such a high number of requests per second, a roadblock anywhere could choke the software system. In Cortex, too many requests to the memory cache resulted in lots of cache misses. At times, concurrent requests to the BigTable NoSQL data store soared to 20,000, leading to latency of up to ten seconds.
Grafana Cloud end users’ experience was at risk with the high latency. If Grafana’s product performance impacts an end user’s implementation directly, the situation is particularly urgent. The Grafana Cloud teams needed to address customer concerns quickly or risk cancelled subscriptions.
In the past, the teams turned to their metrics and observability tools to investigate alerts, debug and solve end-user issues. The weakness of these otherwise useful tools is that they provide a rather broad view of systems. They could lead Grafana Cloud teams in the right direction but leave them guessing at the final answer. They desired a direct route to final answers, a way to drill down to specific requests and fix issues like slow queries quickly.
The teams at Grafana Cloud had heard that tracing could eliminate guesswork and swiftly uncover the causes of slow queries and other issues. They desired tracing technology that could work across their entire software system. They browsed a number of competing options, including Zipkin, and determined that Jaeger was the best distributed-tracing tool for their needs.
Jaeger allows users to search for queries via duration, quickly pick out the slowest ones and begin investigating. This helped the teams single out poor performers with less guessing. They were also impressed by features that leverage both tracing and logging to expedite debugging. For example, the ability to place Jaeger trace ids into system logs quickened troubleshooting.
Another feature they liked was Jaeger’s contextual logging, which records logs in tracing spans. It allows users to conveniently view all logs for a single request in a Jaeger trace. This stood out as means to quickly visualize all slow queries in their system. Also, they can use it to generate reports showing slow queries for individual customers.
Implementing Jaeger provided a quicker, more direct route to system problems and solutions. Together with their logging and metrics tools, it rounds out a comprehensive troubleshooting tool kit.
Grafana Cloud’s Cortex team member Goutham Veeramachaneni said that Jaeger removed the guesswork from their DevOps. Once the Cortex team instruments a service for Jaeger tracing, they can quickly identify poorly performing queries. The logs in Cortex print trace ids to make troubleshooting easier. They can then pinpoint root causes with Jaeger Tracing views.
“Before Jaeger, we were doing mostly guesswork. After having it in place, we’ve been relying on Jaeger to determine what’s slow or showing performance degradations on our system, and we were able to perform a bunch of improvements since then.”
Jaeger has helped the Grafana Cloud teams focus their efforts. Using the Jaeger UI to identify requests above a certain time threshold makes the worst offenders quickly apparent.
“If Cortex is having slow queries, Prometheus will send us an alert, and then I go to Jaeger, and I say, ‘minimum duration of two seconds,’ and then I click ‘Search.’ It gives me all the trace results. I go to the [longest] one, then I try to figure out why it is slow.”
As a result, the Cortex team went on a “query optimization spree,” Veeramachaneni says.
“You can take a look on our repo to see that the activity in the last two or three months is super high, and most of those are query optimizations.”
With this superior repair kit in hand, the Cortex Team has seen query performance jump 5x to 10x in some instances. For example, they traced
memcache misses to a timeout configuration. Simply increasing the timeout, and subsequently the hit rate, solved the problem. Tracing also helped them find the speed bumps on the path of concurrent BigTable requests. They successfully cut down latency by optimizing reads and writes to the database.
Of course, tracing may not reveal the cause and solution to every problem that ever occurs in an entire system. But even when it can’t, that itself can provide a useful clue, according to Dieter Plaetinck, Principal Engineer of the Metrictank team at Grafana Cloud. Dieter’s team benefited from Jaeger’s ability to quickly eliminate possible causes in the past. When his team could not discover the cause of a slow query with Jaeger, they knew the culprit was probably deeper in the system.
“Often, there are more low-level problems going on with our Go runtime being stuck somewhere that is not visible in the trace.”
In this way, Jaeger can help initiate productive conversations and new investigations within engineering teams.
Expansion within Grafana
The Metrictank team has used Jaeger in the past with some success. They particularly liked its contextual logging capability that puts logs into spans. However, the team experienced issues with tracing data overloading the collector.
The team plans to redeploy Jaeger in the near future. They are currently looking into Jaeger’s Remotely Controlled Sampling feature to solve their overloading problem. Sampling allows users to select how many traces Jaeger collects from a particular service.
Plaetinck is also looking forward to two new Jaeger features coming down the pipeline. One is the use of gRPC for Jaeger-internal communications, which will improve load balancing and further help prevent overloading the collector. The other new feature will produce latency histograms — visual representations of query performance across a whole system. Users can instantly visualize the slowest queries without having to search for them via duration, etc.
With histograms and existing features like Apache Flink streaming jobs for real-time service graphs, Jaeger is shifting from search towards data aggregates. These further reduce guesswork, allowing engineers to effortlessly surface queries in need of tuning without searching or estimating.
The engineering team that works on hosted Grafana also plans to implement Jaeger in the future.
Subscription software providers must resolve issues quickly to satisfy customers. Tools like Jaeger that enable them to swiftly improve customer experience are as valuable to them as their code. They offer one of the best ways for SaaS companies to secure customer loyalty — improving quality of service. Implementing Jaeger has made Grafana Cloud engineers’ lives easier and improved service in ways users are willing to pay for month after month. Put simply, Jaeger has resulted in “happier customers,” according to Veeramachaneni. That is the ultimate measure of any troubleshooting tool for service providers.
Jaeger resolves issues in more than one way. It both answers questions and raises them. Engineering teams with complex systems will ask the right questions with Jaeger to guide them. It can lead them to discover and fix issues they had no idea about. This is why Veeramachaneni’s words of advice to anyone considering Jaeger are: “Instrument first, ask questions later.”