Twilio Engineer Shares How They Achieve Five 9s of Availability

LightstepHQ
LightstepHQ
Published in
4 min readMay 17, 2018

May 17, 2018 | Dennis Chu

In our recent tech talk on SD Times — Managing the Performance of Applications in the Microservices Era — Tyler Wells, Director of Engineering at Twilio, shared his insights on how to effectively manage the performance of microservices-based applications and how they achieve five 9s of availability and success.

Tyler said that integrating new tools and solutions into a developer’s workflow can be a challenge for any organization: there needs to be a big carrot. For Twilio, the carrot was a 92% reduction in mean time to resolution (MTTR) for production incidents, and 70% improvements to mean latency for critical services. Now, they can also detect failures before they impact customers. This article shows how they accomplished these results and how other organizations can do the same.

How Twilio Integrated [𝑥]PM into its Engineering Process and Workflow

Tyler described why his team was motivated to try [x]PM and how it fit into their workflow. “Twilio was born and raised in the cloud and has always been built on distributed microservices. My team was an early adopter of LightStep. We were excited about the opportunity to instrument and add tracing to the complex distributed systems we have in the Programmable Video group. You can imagine that setting up a video call involves a lot of steps, and there are a lot of systems. The orchestration messages have to pass through: authorization, authentication, creating the Room [session], orchestrating the Room, adding Participants to the Room. These are all distributed systems, so we added tracing, including Tags and rich information specific to our business, and we started watching. We watched the p99 latency, and we started honing in on the outliers. As we highlighted these outliers, we pulled the information we needed to help identify one of these Rooms using [the Room’s] Sid or GUIDs. We used those IDs to look through [LightStep] and figure out, from the highlighted spans showing the latency, exactly what was going on. That was our first experience with LightStep and how we started to derive value.”

Monitor latency, alert on SLA violations, and focus on the outliers to quickly determine root cause

How Chaos Actually Helps

Tyler talked about the benefits of always assuming that things will break. “We like to break our systems before we put them into the hands of our customers, so we do a lot of Chaos Engineering. We use a tool like Gremlin to start breaking things. LightStep makes it easy for us to be able to hone in on what happens when things go wrong. We know when you’re operating in the cloud, everything is going to break at some point in time. Using LightStep in conjunction with our ‘Game Days,’ we got a ton of visualization, so we could create the SLA alerts, which we have integrated into PagerDuty and Slack. If incidents are triggered, our team immediately shows up in a Slack channel and all of the rich LightStep information is there for us to help identify issues.”

Achieving Five 9s of Availability and Success

Tyler explains how they achieve operational excellence. “We have a program at Twilio called Operational Maturity Model (OMM). It’s a program all teams must follow when pushing product into production. The program has a number of different dimensions: LightStep sits in the Operations dimension. We have a specific policy in the Operations dimension that’s literally called LightStep. There are a number of items in every dimension that teams need to check off to reach a specific grade, with the highest grade being Iron Man. In order for any team to go into production and claim general availability, they have to implement LightStep, use LightStep as part of their Game Days, and they have to achieve Iron Man status. That’s how we use it at Twilio.”

Tyler summarized Twilio’s focus on operational excellence to build customer confidence: “We typically target five 9s [99.999%] of availability and five 9s of success. Generally speaking, 5 9s is discipline, not luck.”

Overcoming Resistance to Change

Tyler described how his team was able to show results and convince other teams at Twilio to use [x]PM. “Any time you try to introduce a new tool to engineers, there’s always going to be some level of resistance. Everybody has more work on their plates and in their backlog than they can handle, and then someone shows up and says: ‘hey, here’s this really cool tool that you should try.’ It’s always met with a healthy dose of skepticism. We had some teams that were early adopters that really derived incredible value from using LightStep. We were able to articulate those results and show other teams (that may have been skeptics). We showed how it helped us solve production-level issues, meet our goals on the operational excellence front, and deliver that higher level of operational maturity to our customers.”

Watch the tech talk, Managing the Performance of Applications in the Microservices Era, to get all of the details about how Twilio is using [x]PM. Don’t miss the demo to see [x]PM in action.

Originally published at lightstep.com on May 17, 2018.

--

--

LightstepHQ
LightstepHQ

Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale or complexity.