Aggressive Frontend Error Reporting

Josh Goldberg
Codecademy Engineering
7 min readDec 16, 2021
Chart showing a blue line of errors below yellow and red thresholds

Frontend crashes are bad! Users hate it when apps crash on them. At the very least it disrupts their experience and might make them lose some work. Users with low self-confidence may also think they’ve done something wrong and be discouraged.

Education users such as Codecademy learners are no exception. We’ve seen people turn away from learning to code at all because of a single bug in an education product. Providing a friendly, stable environment is therefore even more important for us.

At Codecademy, we’ve been moving our core website piece-by-piece to a new Next.js system integrated with Datadog for telemetry. This blog post will cover how we’re using Datadog dashboards, monitors, and queries at Codecademy to keep frontend crashes to a minimum.

Frontend Errors Goals

Our goal is not to have zero frontend crashes in production. That is a nice idea but practically impossible. There will always be some crashes: both in our control, such as from newly introduced edge cases, and out of our control, such as from external browser shenanigans.

The steps we took to reduce errors can be summarized by two main areas:

  1. Classifying: being aware of what already exists on the site, knowing what is or isn’t critical to fix and linking that alerts for consistent elevations.
  2. Reducing: chipping away at removing our top-hitting crashes by either fixing or re-classifying them as appropriate.

By applying consistent effort over time to those two areas, we’ve been able to reduce our user-impacting frontend errors down to almost zero. Which is as good as we think is realistically possible to get!

Classifying Errors

We generally consider there to be three classes of errors, with errors defaulted to 🔥 Actionable until & unless they’re downgraded:

  • 🔥 Actionable: real crashes in code that we should treat as bugs and fix
  • 🤷 All: both 🔥 Actionable errors and those generally out of our control but we would want to know spikes of, such as failures to load resources
  • 🙈 Ignored: those completely outside our control, such as from browser crashes and browser extensions

The distinction between 🔥 Actionable and 🤷 All is important. We’ve seen bugs introduced to production that seem to be spammy errors at first, such as server misconfigurations preventing clients from properly loading some pages, so we don’t want to completely ignore them. At the same time, there will always be a baseline level of spam that we don’t want to alarm over.

A Unified Dashboard

We set up a central Datadog dashboard containing visualizations of queries on 🔥 Actionable and 🤷 Non-Actionable errors. The dashboard starts off with a link to our central Notion documentation page. Each query is given a row with a table showing its highest hitting errors, as well as a nice colored timeseries chart display of those same errors:

The errors dashboard while we were still fixing errors, showing both 🔥 Actionable errors and 🤷 All errors.

Each query also has a brief explanation on the right with a link to its associated Datadog monitor that uses the same query to fire an alert if the count exceeds a threshold over 5 minutes.

Queries, Explained

The backing query for 🤷 All errors executes on all errors reported to Datadog by its browser-logs package in production, with a visualization counting on unique error messages. The backing query filters out a few known causes of external issues using roughly this Log Search syntax:

*codecademy* -"*chrome-extension*" -"*ChunkLoadError*"

Explaining each portion:

  • *codecademy*: only include error stacks that our site name (so, exclude browser extensions that have nothing to do with us)
  • -*chrome-extension*: also exclude Chrome extensions, since they’re very common and sometimes wrap around global APIs that our code calls
  • -*ChunkLoadError*: ignore errors caused by the network dropping in the middle of page load, which cause Webpack to fail to load a chunk

The backing query for 🔥 Actionable errors is similar. It also excludes known 🤷 All errors. For example, excluding a known Chrome bug with React apps: -"Failed to execute '*': The node to be * is not a child of this node".

Reducing Errors

Visibility into errors is great, but it’s not extremely useful unless developers know how to fix those errors and actively use that knowledge. Any good system needs multiple layers of documentation to explain it and raise interest.

Presentation slide titled “Portal App RUM” from Josh at Codecademy. Mentions “Monitoring the living heck out of frontend crashes on portal-app”.
Slide from our all hands presentation. In retrospect, we could have used more emojis.

For this frontend errors work we went with:

  • A summary docs page in our team Notion, with more detailed sub-pages
  • A team-wide “brown bag” presentation when it was released
  • A heads-up and demo in our monthly engineering “all hands” sync
  • A series of informal “mob”-style pairings where we tackled a top-hitting error as a group and explained the debugging process
Slack post from Josh offering to mob pair on infrequent, mysterious crashes with a link to a Datadog dashboard.
Slack post offering signups for group sessions on tackling errors. This time, we did use a few emojis.

Error Debugging

Some developers dread needing to debug frontend errors on production. That’s a reasonable reaction: call stacks sometimes don’t exist or are misleading; different browsers or versions of the same browser may act differently; it can be hard to pinpoint what users were doing before the error.

But! By the power of Datadog tooling, JavaScript sourcemaps, and general knowledge of browser quirks, we can turn error investigations into a fun mystery. Our error investigations tend to follow a track something like:

  1. Identifying context: Some errors are unique to a particular user situation that often hints at or even indicates the root cause.
  2. Understanding call stacks: Using sourcemaps to pinpoint where in code the error is being fired, and from what call chain of functions.
  3. Recreating locally: Using the context and source to try to reproduce the crash locally for debugging, if possible.

Identifying Error Context

The dashboard table views of queries include a context menu action on errors to Query in RUM that leads to a Datadog RUM Analytics page searching for that error’s message.

Clicking on an error to bring up its context menu.
Linked RUM query page for that same error.

Knowing the context causing an error is often invaluable in determining why the error is happening. We tend to check at least these facets for an error:

  • Browser name: some errors are unique to one browser, such as Firefox’s unique and powerful privacy and security permissions controls.
  • Browser version: some errors only fire in older -or newer- versions, such as old versions of Chrome having edge cases in this scope handling.
  • Language: some errors only occur in particular locales, such as unusual or invalid time zones set by users or their privacy settings.
  • URI: some errors only happen on particular pages and/or page queries.

🐛 Past error example: TypeError: Illegal Invocation. RUM facets showed the error to be unique to 2-year-old versions of Chrome. That helped narrow Googling results down to navigator.sendBeacon scope issues.
Resolution: Codecademy/client-modules/pull/7.

Understanding Call Stacks

Source maps are the way a developer tool such as a browser can map from minified production scripts back to their original code. At time of writing, Datadog unfortunately doesn’t yet support automatic source map detection the way Sentry does. But you can upload them manually in your builds to inform the call stack shown for details pane on RUM errors.

Error details pane for “Cannot read properties of undefined (reading ‘Name’)”.
Details pane for an error that uses a sourcemap to show the original source location for a crash.

Once sourcemaps are available, search up and down an error’s call stack to see what functions call what, and where. You can sometimes learn useful information just by seeing the chain of calls, such as functions you didn’t expect to be able to call other functions in a particular way.

🐛 Past error example: Cannot read property 'style' of null. Sourcemaps showed this to be in code called by old React class components using legacy ref callbacks with hard-to-understand handling of asynchronous logic. Resolution: we refactored to cleaner, more understandable React hooks.

Recreating Locally

No matter how confident you are in a code change or understanding an error, nothing is quite the same as seeing it happen in your browser dev tools. We ask that all bug investigations include attempting to create reliable, repeatable steps to reproduce the error.

In particular, our template that code change pull requests must fill out includes testing instructions. Developers are expected to explain to code reviewers how to reproduce the crash and verify the code change fixes it. This serves a double purpose of documenting the crash and fix for future devs, which is particularly helpful for similar bugs or regressions of the same bug.

🐛 Past error example: Unable to fetch data from progress. Recreating this locally only worked when viewing the site anonymously, hinting that anonymous users were failing to fetch from the “progress” service.
Resolution: we stopped anonymous users from trying to fetch user progress.

Closing Thoughts

Presenting a smooth, stable app experience for users is important in keeping them confident and engaged through their user journey. We recommend setting up clear documentation, dashboards, and monitors on your app so that you can understand where the crash points are for users and how to resolve them.

If creating stable and enjoyable web apps is of interest to you, we’re hiring! codecademy.com/about/careers

Many thanks to my excellent coworker Melanie Mohn for proofreading!

--

--