Starting out, I didn’t exactly establish a set of clearly defined goals. Overall, I just wanted to find a good service to report exceptions that might come up across the various browsers and devices we support. Roughly speaking, here’s a list of what the service should be able to do for us:
- It should provide us with enough information to reproduce errors to help resolve them. Obviously, we want the typical reporting information: stack traces, browser/OS versions, frequency, etc. But related to the first point, we also want to be able to provide extra information ourselves through the reporting client, such as AJAX params, member information, or other environment variables.
There’s a ton of tools out there for error monitoring, and I’m sure most of them would suit our needs described above. I didn’t want to get too caught up in trying out a bunch of different services, so here are the four options I considered:
- New Relic. We’ve been using New Relic since I’ve started working here to monitor our app ecosystem’s general health, and it’s fantastic. I think it was sometime last year when they announced New Relic Browser, their monitoring service for the front end. The AJAX insights are where this service really shines, but the error reporting aspect of it didn’t give us enough flexibility (there was no reporting client available).
- Honeybadger. Our back end Rails apps reported exceptions to Honeybadger, and it was pretty good overall. I did try using HB for one of our front end libraries, but it ended up being too noisy since it hooked into window.onerror without filtering anything out.
- TrackJS. This service looked really promising. It had a flexible reporting client, a beautiful dashboard, and an exciting feature called Telemetry Timeline, which provided context of the events leading up to the thrown exception. I tried it out for a week and— at least for the errors captured during that time— it didn’t seem to do a great job aggregating similar errors, the Telemetry Timeline wasn’t very useful, and it was quite noisy. In hindsight, I probably should have given it more time in production. With a bit of playing around, it may have turned out to be a nice solution to our problem.
The Winner: Sentry
Sentry was the last service I tried, and it stuck. There’s a bunch of reasons why I liked it the most. I’ll boil it down to three.
First and foremost, the dashboard did the best job surfacing the most important errors using its Priority sort, which is a weighted score of the time of the last seen instance of an error and its frequency. Both its list views and detail views provide us with the information we care about most, in the simplest, most concise manner out of the group. Minimal clutter, maximum readability.
Secondly, it has a bunch of core and community driven integrations with other services. For us, it allows us to hook into HipChat for channel notifications and JIRA to create tickets. In addition to these integrations, Sentry allows us to create notification rules— i.e. we only want a HipChat notification the first time an error is seen, and only if it’s hit a threshold of x events reported in a given minute.
Lastly, it’s open source and supports a variety of platforms. With the source code up on GitHub, we could actually run our own Sentry server if we ever have the need to. The paid options are good enough that we’ll probably stick with it. Regardless, we still get the benefit of community driven improvements and bug fixes. The multi-platform support turned out to be an unexpected benefit, since we ended up moving all of our server-side apps to report to Sentry as well.
Things I Learned
Signal vs. Noise
There was a common theme among the cons of all the services I looked into: noise. All the services dumped an overwhelming number of errors. Some were legitimate, some were out of our control, and many were duplicates of others.
The stark reality is that the noise is not going away any time soon. We’ll most likely never reach a state of zero error zen, but that’s okay. That shouldn’t have been a goal in the first place. The real goal for us is to be able to identify actionable and important errors, with some clues on how to reproduce and resolve them.
The State of Stack Traces
How We Use Sentry
We look for the spikes in the stream. Sorting by Priority is usually the best way to filter errors. You can see how noisy the stream is, but looking at the error frequency and the number of users affected, relative to the entire feed, is a good indicator of a major issue.
Let’s dive into the cryptic, vague [object Event] exception.
As you can see, this is one of the more useless error messages. It also doesn’t have any stack trace whatsoever. We then ask ourselves: is this error actionable and high priority? The graph shows it’s happening more than just intermittently, and that it affects browsers we support, so the answer is yes.
Sometimes we’ll come across errors with nearly zero context from which we can try to reproduce them. Or, we’ll find errors with bizarre messages. For the latter, if it’s coming from a single browser version, it’s usually an exception from a browser plugin or third-party library. These are examples of unactionable errors which we simply ignore and mute.
At this point, a JIRA ticket is easily created through the built-in integration (see the Create JIRA Issue link in the sidebar). The person assigned this bug will likely weep in despair, but we’ve done our best to ease the pain. Our Sentry setup aims to provide as much contextual information as possible and lead us in the right direction.
It’s Not Over
PS. If you’re into the sort of things described in this article, Crowdtap is hiring!