Weave trimmed troubleshooting fat, cut API response time from seconds to milliseconds with Jaeger
At a glance
- 250 microservices running in 1,700 containers on Kubernetes
- Numerous software services written in Go
- 101-person Product & Development team including 72 engineers
- Troubleshooting complex microservices-based system required a number of point solutions, tedious processes
- Lacked the manpower to properly operate log-based tracing system built in house
- Tracing products on the market were too costly, labor-intensive
- Needed a single easy-to-use tracing tool for root-cause analysis, pinpointing timing-related issues, etc.
- Jaeger tracing
- Single, easy-to-wield tracing tool provided direct route to causes of diverse system problems
- Ability to quickly pinpoint root causes simplified, shortened, troubleshooting processes
- Logs, request timing grouped with traces in UI conveniently provided fuller picture of system issues
- Improved API response time from several seconds to tens of milliseconds
- Provided detailed observability across 250 microservices to complement the statistical metrics provided by Prometheus.
Weave is a VoIP phone system coupled with desktop software and a mobile app for small and medium business owners to manage customer relationships and streamline business processes. Used in thousands of businesses across the United States, Weave’s solutions enable personalized interactions between business owners and their customers and potential customers. The software is named for its ability to weave together customer data with customer interactions to make integrated, up-to-the minute information available. Weave is a complete toolset that combines phone, two-way text messaging, email, fax, payments and over 35 other features to simplify office life.
When Weave decided to break its architecture into microservices, the result was both a boon and a bane to its system-maintenance team. They suddenly had to manage 150 microservices running in 500 containers on Kubernetes. The benefits they gained were greater flexibility and agility. For instance, engineers and developers could single out smaller application components for rapid, targeted tests and upgrades. However, all those microservices ultimately added up to a very large number of moving parts that might malfunction or require tuning at any time.
The team realized that their new system required a new approach to analyzing issues and debugging. Would they be forced to purchase and learn to use a number of additional monitoring and troubleshooting tools? Would the company have to hire more engineers and support agents to solve internal and customer-facing problems?
Problem: Finding the right tool for the job
Like many IT departments today, they had plenty of point solutions that provided some piece to the larger system-maintenance puzzle. With complex microservices-based architecture now on their hands, the last thing they wanted to do was add more point solutions; nor did they look forward to finding and training new engineers to manage them. They desired a single tool powerful and versatile enough to troubleshoot their entire system. Ideally, it would not be a burden to implement, or require extra manpower to operate. It was a tall order, but they believed that tracing might deliver what they were seeking.
The Weave team set out to design their own tracing system based on logs. Unfortunately, what they were able to produce was not the lean, low-maintenance aid they desired; it was complicated, operationally demanding and ineffective, according to Weave Chief Architect, Jason Newman.
“We’ve never had the manpower or the people power to actually make it work the way it needed to. We were not able to have true observability on our services, or the real ability to analyze timing-related issues, or to leverage root-cause analysis on top of real issues,” Newman said.
The tracing products on the market were too “resource intensive” in terms of cost and labor for Weave at that time, he added.
Solution: Jaeger tracing
Despite their initial failure, the Weave team did not give up on tracing; they still believed it was key to troubleshooting microservices-based systems. They hoped their ideal tracing tool would someday materialize in the real world. Then, in July 2017, they found it; Jaeger for distributed tracing was easy to wield, broadly applicable, and didn’t gobble precious resources.
“When Jaeger [by Uber] was made open-source, we jumped immediately in because it was what we had wanted to build, and someone did it for us,” Newman stated.
One major advantage was that Jaeger was written in Go. This made it a natural fit for Weave, most of whose services were also written in Go. This spared the team from having to master a new programming language and sped their progress through implementation into everyday use. "That Jaeger was in Go made it much easier for us to deploy, manage and actually look at the code and know what was going on,” Newman said.
They began instrumenting services simply by adding OpenTracing middleware into libraries shared by HTTP, gRPC and message-queue consumers and producers. Over time, as applications were redeployed, they automatically picked up these new, modified versions of the libraries. So far, the company has instrumented 90% of its microservices for tracing. This represents just about everything in its system that can be instrumented, Newman said. It collects 2.2 million spans per day and retains them for a week before discarding to keep storage costs low.
Thanks to Jaeger’s simple user interface and keyboard shortcuts, Weave has been able to democratize tracing throughout the company. Developers, engineers and even support agents use Jaeger to make their jobs easier, improve Weave’s software and quickly resolve customer complaints.
Benefit #1: Trimmed troubleshooting fat
Jaeger added observability and targeted debugging capabilities to Weave’s troubleshooting arsenal. Equally important are the things it subtracted; Jaeger simplified root-cause discovery so that some tasks and tools are no longer necessary. It reduced the time and toil needed to resolve issues, resulting in a less stressed staff and more satisfied end users.
The Weave team has found that system measurements aren’t needed as much since implementing Jaeger, Newman said. Jaeger alone is usually sufficient for all of their application-observability needs. Team members overwhelmingly prefer it over Prometheus metrics. This is partly because the UI in Jaeger is much simpler than the one in Prometheus, Newman said.
Jaeger simplified and shortened the process of log discovery within Weave. Before Jaeger, team members had to filter through three or four systems to find the logs they sought. When they attached their logs to tracing spans via OpenTracing APIs, the logs were suddenly there at their fingertips. Now, in general, when there are logs, they are conveniently grouped with a specific trace right there in the Jaeger UI, no scavenger hunt required.
Through traces which show log messages and database requests, the team often gleans sufficient information to solve system problems, Newman said. The details of request timing, for example, are now easily viewed in the trace, whereas in the past, they were difficult to dig up. Tracing can spare the team an intensive rummage all the way down to the Go
pprof level of application inspection. “They’re much happier,” Newman said.
Benefit #2: Direct path to root causes
Several times, tracing has led the team straight to root causes that nobody suspected, Newman said. This helped them speed up issue resolution and optimize query performance in some cases. For example, it revealed that some microservices were sending too many messages to each other, draining resources and adding to latency. Also, sequential requests were found to be slowing down some services. One particular service was making 30 requests, one after another; simply parallelizing them improved service performance immediately, Newman said.
In another case, parallelizing metadata collection dramatically improved API response time. While investigating an API endpoint that was taking several seconds to respond, instead of digging through code or logs, the team quickly identified the root issue with a Jaeger span visualization. The endpoint in question was a list which was attaching metadata to each of its items, one by one. Again, parallelizing all of those requests dramatically cut the time needed to complete the task. “It [went] from several-seconds response times to tens of milliseconds to return the list,” Newman said.
Benefit #3: Easy buttons empower all
Jaeger has slimmed down processes not just for the troubleshooting team, but also for developers and customer-support agents. The company added keyboard shortcuts to all of its internal tools, so those with minimal knowledge of Jaeger’s technology can still benefit from it. In fact, the tool’s UI is easy enough for support agents to navigate through if need be.
Developers can go to the Jaeger UI and enter a trace id to see a visualization of spans within a trace, attached logs, and timing. When working on a bug or new feature in an application, they may notice a suspicious trace and make a note of it. They can return to it later and potentially nip issues in the bud before they cause major problems.
Support agents can tap keyboard shortcuts to quickly pull up trace ids connected to specific user issues. They can include the trace ids in support tickets and then send them directly to engineers. Streamlining the technical-support process in this way can expedite problem solving and help retain customers.
Jaeger allows team members to easily perform a quick checkup on services in production. “With a new service or whatever, it’s nice to be able to pull up a trace and see that other services are getting the data we expect them to, and that they’re returning the responses that we expect as well,” Newman said.
Weave’s Jaeger wishlist
Jaeger’s query interface allows users to narrow down searches in a number of ways, including by application or endpoint. This can be a great way to pinpoint root causes, but sometimes the Weave team simply doesn’t know which application to search within. Newman and his team would like to see broader search parameters in the future. For example, they’d like to tune queries to search across all services for certain defined trace types. This would be highly beneficial to Weave’s developers, he said.
Luckily for Weave, Jaeger does have a feature for searching across services on its roadmap. For the moment, to meet its need for additional search capabilities, Weave is considering switching from Apache’s Cassandra database to Elasticsearch, Newman added.
These concerns aside, the Weave team rarely comes to Newman with complaints about Jaeger. So far, it is “exceeding everyone’s expectations for how little effort it has taken to run it,” he said.
When Weave broke its monolithic software system into a large number of microservices and containers, Jaeger arrived on time, economical and versatile. Sparing the company a software shopping spree, this single tool provided a powerful means to troubleshoot its complex IT environment. In fact, it helped Weave declutter its existing problem-solving toolkit. Fewer tools and lighter, streamlined processes help employees finish tasks in less time, and ensure a smooth-running product for end users.
Jaeger’s simple UI and targeted search features help engineers quickly zoom in on root causes. This facilitates the resolution of individual issues much better than metrics tools that provide a broad, aggregate picture of systems.
However, it cannot lead engineers to problems in applications and services that have not been instrumented. This is why to get the most from Jaeger, to derive from it a direct line to system issues, to make it Plan A in their troubleshooting toolkit — companies should not adopt it halfheartedly, according to Newman. They should plan to instrument everything in their system that can be instrumented, he said.
Offering a final word of advice to those considering Jaeger, or any tracing tool, Newman said:
“Jaeger is a great product that provides insights and observability into our backend systems. For the people considering whether to use or it not, or tracing in general, either you do it all in on tracing, or you don’t. There’s some groundwork that you need to do, and if you leave gaps in your applications, it turns out to be less useful.”