Front end observability. A practical guide to browsers error monitoring with window.onerror 💂‍

omrilotan
Fiverr Tech
Published in
12 min readSep 11, 2019

--

Front end application code, more than any other, runs on environments we have little to no control over.

Each browser has its unique set of attributes, feature support, connectivity levels, and more. In modern applications users configure half of the features, A/B tests alter the rest, and user installed browser extensions impact your data transit and code execution. All of this create a highly volatile environment for browser applications code to execute in.

Due to the combination of the execution being remote from our infrastructure and the runtime environment being especially noisy we are inclined to neglect the errors firing form the browser, and sink into a blissful lull of silence from our browser applications.

At Fiverr we have become acutely aware of the richness of browser application errors, and gradually improved on the flow and quality of the process of collecting and handling error logs to the point where we rigorously monitor our browser applications. Over time I’ve learned lessons that may benefit others. I know what you’re thinking: “Why won’t you use Sentry?”, Well, we’re already not doing that.

🎉 window.onerror: Good news, everyone!

Our web applications usually run Javascript and share a global scope called window. When a runtime error is not caught and handled by your functional flow it ends up triggering a window.onerror event (as well as window's 'error' event listeners). This interface may furnish great opportunity for insights on obstacles your visitors encounter while trying to fulfill their endeavours.

We must not squander the gift of window error invocation. The fact that we get this all catching event listener for free is only the first step — now we must do something with it.

I intent to display the simple design required to log your errors, protect your infrastructure, reduce false positives, and finally create the right alerts. My design specifically catches uncaught errors; unexpected errors that have been thrown outside a try/catch block or a promise catch clause, then forwards to a log aggregation service through an optional mediator layer.

💁‍♂️ There is no particular order to read this article. Feel free to jump to topics that interest you more than others. Also, the implementations in this article are meant to inspire, apply at your own discretion.

Key players of our system

  1. Browser — The web application is the runtime of your visitors’ clients. In its global scope we will write the error handler that catches and forwards the error as a log record, preferably early in its life cycle, somewhere near the top of the document.
  2. Mediator (optional) — This layer allows us to apply logic before forwarding logs to our log aggregator such as: visitor authentication (session tokens), filter known issues, modify log levels, enrich log records, and collect statistics in side systems. It can be implemented as a serverless function connected to an API gateway or a sustainable HTTP API service — everything goes.
  3. Log Aggregator — This layer can be a self hosted database like ElasticSearch with some system on top which can manage streams and alerts like Graylog, or a hosted log solution. This layer will be the first place your developers start their investigations.

It’s really a very simple system

🖥 Browser

Make sure you’ve set CORS headers

Before we start catching and sending errors, this prerequisite is usually quite necessary.

Script files hosted on domains other than your web application (maybe your CDN) will not reveal where an error occurred, what the error was or its stack trace. Instead you will see the error message: Script error..

This, obviously, does not contribute to visibility. Adding crossorigin attribute to script tags sets the CORS policy to "same-origin". The value anonymous means that there will be no exchange of user credentials, unless it is in the same origin.

<script src="..." crossorigin="anonymous"></script>

To make long story short — you will now be privileged to the full details of the errors.

Here’s a little caveat — beware of browser cache. If you only add the attribute to an existing script tag that is cached by the browser — users may get a CORS error because the response is already cached without CORS headers. A workaround to such situation is to add a query parameter.
<script src="...?v=1" crossorigin="anonymous"></script>

Don’t bully your tenants

We’re going to catch unhandled errors using the window.onerror attribute. You should be aware that someone may have already registered an onerror handler in your runtime.

Be considerate of other occupants sharing the global runtime. It is in everyone’s best interest that vendors are able to monitor and address their own errors.

While overriding existing listeners make sure to trigger them yourself. You can call on them before or after your logic.

Also don’t return true. It will prevent firing of the default event handler.

Create limitations

Once set up — errors may start flooding your system. Consider what conditions constitute an error you don’t care about, and filter them early on. This will help your team focus on the real issues.

For example, a broken page may throw gobs of errors that all originate in one breakage. It won’t do us much good to get all of them — I limit the number of errors on the same page to 10.

Details about the error

The interface of window.onerror exposes details that help understand what the error is and where it originates. The error object can not be serialised to JSON for an HTTP request payload, but you should extract its stack.

Lots and lots of details

The more the merrier. Your developers will have to investigate the error logs, they would want to get details that will help them reproduce the issue, speculate on reasons for its occurrence and hypothesize the circumstances for its manifestation.

We can derive plenty of enrichments from browsers API

Really, the sky is the limit here. But your unique business might have more light to shed:

Add some unique details

Don’t take it from me — take a look at GitHub’s source code for a web page. Meta elements carry information from the server to the client including but not limited to:

  • Request ID (Check out universally unique identifiers for log correlation).
  • Username and user ID
  • Timestamp with date of the request
  • List of enabled features
  • Analytics information
<meta name="correlation-id" content="123e4567-e89b-12d3-a456-426655440000">
<meta name="user-logged-in" content="true">

I like this approach but you can pass information using a global scope variable rendered on the server or any other way you can imagine.

The important thing is to attach these details to the error log. It will prove very helpful when investigating reoccurring errors for common denominators or patterns.

A correlation ID will prove especially helpful in correlating with server logs in case you implement log correlation methodology.

Custom error fields

Consider your errors are now beautifully collected and enriched. Your developers can now prefer to simply throw errors instead of actively sending them to a logger. Allow your developers to add details to errors thrown.

Then you collect on the custom fields, just like you picked up the stack trace

Declare log stream / subsystem

My developers can add a meta tag to the page so my mediator knows where to divert the logs. It will allow teams to take full ownership on their errors.

<meta name="log-subsystem" content="user-page">

☁️ ️Mediator

The mediator is an optional layer, but my mediator service has proved very helpful — I use it to enrich log records, decide on the record severity, paint certain errors with special error codes, and refer records to relevant streams.

The mediator can be as simple or as elaborate as you want and can run as a lambda function diverting traffic — or a sustainable service. The client should not wait for response from this service and it should not work as a data retrieval system — but simply relay messages to the correct endpoints.

It could, preferably, add a verification layer and perform as a buffer to protect the log system from mistakes and overflow situations.

More Enrichment

My server can add some details that the client does not necessarily know, or simply spare calculations from the browser.

  1. Identify known crawlers
  2. Add IP, country, user-agent string.

Divert “known” issues

Most on-call developers suffer from a condition I’m coining right now called “ log fatigue”. I’ll take any chance to “reduce noise” — shift issues I do not necessarily expect my developers to address as regularly and as urgently as ordinary application errors. These logs have their own thresholds. I also lower the log level from “error” to “warn”. Here are some of them:

  • CORS errors (Script error.)
  • Errors coming from (identified) bots
  • Errors thrown from browser extensions (Source file is in protocol moz-extension://, chrome-extension://)
  • Missing global dependencies (React is not defined)
  • Scripts which have only external sources in their stack trace (Also addressed in the next segment)
  • Missing basic polyfills for some reason (Uncaught ReferenceError: Set is not defined)
  • Syntax errors caused by network issues (SyntaxError: Unexpected end of input)
  • Any other error you want (like localStorage access on a private session in Safari)

This is not to say we do not set alerts on these logs — they’re just different in sensitivity and urgency.

Figure out from your logs which errors are considered acceptable to you and make it easy for developers to suggest edits and additions. Document this process rigorously.

All logs are tested against these conditions by the mediator (from most common to least), and are either being redirected to their respective streams (like 3rd party providers) or to another bulk stream with alerts based on pre declared error codes ( SCRIPT_ERROR, MISSING_DEPENDENCY, MISSING_POLYFILL, etc.). This practice proved impactful.

Create separate streams for providers

When the file from which the error was thrown is provided by a 3rd party vendor — I choose to divert the logs to specialised streams:

All unidentified 3rd party errors can get their own group stream — but this practice allows us to enforce a tolerance policy and to disable 3rd party scripts on the fly if they introduce critical issues.

🗄 Logs aggregator

We send this load of information to our logging system hoping we can make some sense of all of it. Now is the time to look into it and prioritise.

Don’t be discouraged if the volume and variety are intimidating at first. We’ve placed mechanisms to diverge streams and tone the noise down. Don’t hesitate to add more rules and exclude items from the main stream to make it more approachable. The goal is to have a proactive resolution derived from this stream and to get it down — even by means of excluding messages of lower posteriority.

Create alerts

Eventually you’ve had your system running for a while and you should stop looking at logs and get back to introducing more ~bugs~ features. Now is the time to set an upper threshold for the number of error logs. The system should alert you when the status quo has been challenged. Alerts are very important, they bring us back to system when they deteriorate, or alert you when you’ve made a horrible mistake (before customer support starts calling you up), and more importantly — keep us away when everything is fine.

Log reduce / Loggregation

We send as many details as possible to the logs and we want a system that can help us find the patterns once the alerts are firing. Look for this feature in your choice of log provider.

Be elaborative in alert descriptions

Alerts can be intimidating. I find that developers tend to ignore alerts if they seem hard to tackle or are descriptively cryptic.

The nature of the errors we are talking about in this article is one where we don’t expect them (unhandled) — this makes them a prime candidate for developers to ignore.

Not a unique practice for browser errors — we found it is extremely beneficial to instruct the first couple of steps for your on-call developer to take. And pin some informative wikis or links to useful dashboards in the alert content or alerts channel.

For the alerts of our “known” issues (see above) I go as far as adding a paragraph explaining what this error means

Help your database recognise important patterns.

We’ve been adding a lot of details to each log record. If you want to keep your database maintainable, you should choose which fields out of the logs to index, or at least which fields not to index. I would recommend to index fields that would be used to distinguish between errors: message, file, url, and error-code (in case you’ve added one, see “known” issues). Index fields that may distinguish between groups of visitors which you may have neglected to test: user-agent (or parsed OS and browser names and versions), geo-location, localisation. Do not index extremely unique or elaborative fields, like breadcrumbs, or failed request body, since they are usually used individually to try and replicate flows. Remember — the records always remain searchable as strings.

💂‍ Who watches the watchmen?

We have made browser errors visible and actionable. Finally we have the whole operation running like clockwork. We’ve been able to tackle recurring issues and our various alerts are keeping quiet.

But what happens when the unhandled onerror handler has an error? There’s no catch clause for this one. This is the end game.

Be vigilant

In this particular partition of your codebase, make sure you have good code test coverage. Consider exclusively using historic non polyfilled features (instead of [].includes(x) use [].indexOf(x) !== -1 etc).

Catch errors in the error handler

Wrap this whole enrichment process in a try/catch block and replace with the new caught error before sending in case of breakage. Firefox on Linux, for example, will not allow CORS errors to even read the stack: Exception sending window error: Permission denied to access property \"stack\";

Monitor the error log stream

Like any parent I can tell you, if its been quiet for a while — something must be wrong. Monitor your stream for no errors. My favourite alerts are the ones I have set up for the low boundary of streams. My slack calls me up saying:

🚨 [ALERT] Not enough error logs coming in from the browser
This is very suspicious, there were hardly any logs for the past 15 minutes. You should verify the system is working properly:

• 🖥 Browser should be sending log messages to mediator service
• ☁️ Mediator service is up and running
• 🗄 Logging system is receiving and displaying logs

ℹ️ Further information can be found in wiki under the entry “ Front End Logs and Errors (browser)”

🤸‍‍ Extracurricular Activities

There are always more ways to improve visibility. Here are some features you can add to your system to enrich log records or to reduce noise from the system.

Breadcrumbs

Odds are your development team are still going to get plenty of errors they can not reproduce. A trail of user interaction can offer an inspiring window into the situation leading up to the error. I suggest collecting interactions in a global array and sending it along every error.

You can expose an interface for your developers to add breadcrumbs manually from their code (which will probably never happen) or choose to collect a set of pre defined user interactions globally, such as all clicks, touch events, and form submissions.

Avoid errors from old, cached pages

Okay, this one is a bit tricky to pull off but I think it’s totally worth the hassle. I was able to mark errors from old cached web pages by applying this flow.

  1. Server side rendering of the page adds meta data of UTC timestamp on the server.
  2. Browser picks it up and sends along with error logs.
  3. Mediator service calculates how many hours passed since this page was created, and adds a field to the record.
  4. Alert system puts a threshold on, let’s say, pages older than 24 hours.

Page snapshot

Store a snapshot of the HTML in a temporary bucket (with low TTL), to allow a print-screen upon error investigation.

Sourcemap integration

Map the error location using a sourcemap file corresponding to the bundle file. Optionally — send encapsulating 10 lines of code.

Make suggestions

This list can go on, I’m sure. I’d love to hear some of your ideas for improving this flow. Please don’t say “Use Sentry”.

🙌 Thanks to Yossi Eynav for originally pushing to start monitoring browser errors on Fiverr.

Cover image by Paweł Czerwiński on Unsplash
Originally published at https://dev.to on September 11, 2019.

--

--