Moving to Defcon 3

Mike Jones
Salesloft Engineering
6 min readMay 6, 2019

Is that a spike? Is it a blip? It’s an outage indicator! You know that moment when you are staring at your dashboard and there is a bump in the line.

Did a server restart? Was there a traffic spike to an endpoint? Was there a third party issue? Most importantly, is there about to be an outage? All of these questions are things we, as part of the TechOps department, have to know. And we need to know them immediately.

Monitoring tools have been around since the beginning of System Engineers. We have spent time building them both on servers as well as provisioning third-party integrations to help us see under the hood of our server systems. They are robust and usually supply more information than a single person can comprehend. They’re integral to our work — allowing us to see what is happening in real time, and in theory, should help us identify and remediate a problem before our customers even see an issue.

Sometimes, you don’t know what you don’t know. And that’s why you need a solid tech stack to help you investigate, diagnose, and eventually resolve issues.

At SalesLoft we utilize several tools: Sumo Logic, New Relic, Bugsnag and AWS CloudWatch. It means that, at any moment, I have four screens I’m watching, all with graphs on them. I feel a lot like an air traffic controller — where I’ll have many things in flight at once — but over time, I’ve built the ability to discern what’s important. In actuality, I am trying to predict system health — not as much fun as watching an aircraft lift off successfully, but still just as stressful.

Setting up monitoring for applications can be cumbersome. You have things like infrastructure, code, and third-party integrations. Which one is more important? What metric is our leading indicator of an issue? These questions have continually plagued System Engineers.

You have to start by knowing which metrics you can actually report on. What do you have visibility into? What tools do you have to help you monitor system and application health?

So, you may be wondering: How did we decide what to measure and alert on? Simple answer: We had outages, we learned from them, and now we monitor for specifics that caused them. Yeah, that’s right — we failed, and then, we learned from that failure — what a concept. You have to fall on your face sometimes to figure out where to place your feet.

So where do we place our feet? Good question. We utilize Sumo Logic to stream all of our logs, and we have queries that alert on specific items in the logs. Things like 404s, or if someone runs specific commands on production servers, alert us to let us know something is amiss. Fun fact, we also use these logs as security measures. We monitor production servers to make sure they are not compromised. SumoLogic is a powerful tool; you just have to know how to harness it. It uses a proprietary query language to search logs. There is a ramp-up time, but once you learn it, it can be a mighty powerful companion. Also once you set your servers to stream logs to Sumo Logic, you no longer have to store them on the servers themselves, which is a win!

New Relic is one of those tools that we have embraced wholeheartedly. New Relic allows us to monitor application response time with their Apdex score. Apdex is an algorithm NR wrote that monitors response times and graphs it to let you know how well your application is responding for customers. I mean, what is more important than that? But that is not all New Relic does. It also allows you to dive into databases to see queries, application endpoints to see what is slow, third-party response times, and throughput. And if that was not enough, it allows you to see errors and do a code trace to find exactly where the issue is sourcing from.

Come on, that cannot all be true. There is no way that a single tool can do all that.

Stop asking that question, and embrace the awesome. The initial perception of New Relic is overwhelming. There are so many places that are giving you so much data. Eventually though, like most things, you learn what is important. You start to read the graphs with precision. You know where to click to drill down to issues. Soon enough, you may do what we do in our office: set up a huge television with a New Relic dashboard. That way, everyone can see what is happening in-app without having to ask an engineer.

Bugsnag is the place you should capture your error logs. Let’s be honest with each other; errors happen and digging through logs trying to find errors and trying to understand the importance of each error is complex. Aggregating the errors and who they affect in a large stack can be overwhelming. Why not have someone do that for you? Here is where Bugsnag has become an incredible asset for SalesLoft. We have been using it for years to help us identify not only the error but the actual user who was affected. It also allows us to see code snippets where the error occurred. While aggregating the errors Bugsnag lets us know the number of times errors are triggered and what hardware the issue was on. Users can assign engineers to errors and even link Jira tickets. In a world where we don’t want to have errors, it’s always nice to have a tool that will help guide you through the what was causing the error and the ecosystem that the error lives in.

That brings us to CloudWatch. Once a powerhouse of monitoring, it has become something we only use to monitor Amazon-specific hardware. CloudWatch is the native AWS monitoring tool. It does a great job at monitoring AWS hardware, but beyond that, its benefits are non-existent. Don’t get me wrong, in the early days of our application, we relied on it very heavily. But over time, it has become a back burner to newer, more robust monitoring tools.

Ok, we have gone through how we monitor and the services that we use, but we still haven’t gone over what you need to monitor. No, I didn’t mean to dupe you. I am not here to tell you what you need to monitor, because all applications are different. It took SalesLoft years to hone our monitoring. Even today, we are sometimes still guessing, albeit they’re educated guesses.

That’s what monitoring boils down to; you will always have things that everyone knows they should monitor — things like CPU utilization, number of connections, disk space, and endpoint response time. But those are surface level items. You and your team have to sit down and ask yourselves three questions — One: What caused our last outage? Two: Where do we have choke points in the application? And three: How can we monitor those choke points and predict issues before our customers notice?

It’s a scary world out there. Memory leaks and poorly provisioned hardware are lurking behind in back alleys and dark hallways. But you can be proactive. You can be prepared. Air traffic controllers need three years of relative experience, according to the FAA, before they can direct air traffic. I feel like that is a good start before an engineer can parse monitoring tools with samurai precision.

--

--