Web Metrics with OpenCensus and Stackdriver

Introduction

This post describes an approach to measuring and understanding application health as experienced by the population of users of a web application in near realtime. It discusses the example Monitoring Web Metrics with OpenCensus and Stackdriver in the OpenCensus Node.js project and how to extend that to more complex web applications. In addition to the OpenCensus project libraries, the example solution is enabled by leveraging new JavaScript features that are part of recent standards including ES6 and the Resource Timing Level 1 candidate recommendation released by the W3C in 2017. The solution also demonstrates storage and visualization of the metrics with Stackdriver on Google Cloud Platform. The same approach is applicable to monitoring products from other clouds and open source systems via the flexible OpenCensus Export interface.

Use Cases

Collection of metrics for web applications is an important task necessary for web application owners to understand performance, user experience, and user engagement. There are three main uses cases and three corresponding flavors of metrics:

  1. The health of the business — as indicated by business and user engagement metrics such as number of active users and ‘conversions,’ which may be new registrations or online sales, on a day-to-day or month-to-month timeframe
  2. Performance as experienced by the developer— a developer’s detailed view of latency and resource usage data from their own perspective as the application user, done before the code is released
  3. Operational health of the application — real time latency as experienced by the user population, for example, resource loading times for users in different geographies, and aggregate counts of errors and user driven events, on a minute-to-minute basis

The third use case is the focus of this post.

Examples of existing fully managed solutions for the first use case include Google Analytics and the Google Marketing Platform, which include very powerful features for analysis of business metrics. The second use case is addressed by tools such as Chrome Developer Tools. To approach described here for the third use case enables give full control over the collection, storage, presentation, and analysis of metrics data for the population of users in near realtime. For example, you can formulate alerts through Stackdrvier Alerts to page oncall System Reliability Engineering (SRE) staff when increased client errors or elevated latency are detected.

OpenCensus, a multi-language tracing and monitoring metrics collection framework, is the core of the solution described here. The solution uses modern JavaScript (ES6) on both client and server to instrument, collect, and save web metrics in Stackdrvier, where a continuously updated dashboard can be viewed. JavaScript was chosen due to the popularity of Node.js with web application developers. In addition to its use as a server, Node.js has become central to the deployment of both client application frameworks, such as Angular and React, and user interface component libraries like Material Web.

If all this seems like it is overkill for you application, you may want to try an uptime service like Stackdriver uptime alerts.

Architecture

In the example, the server runs Express and can be deployed to App Engine Flex or anywhere else Express is supported. The OpenCensus APIs can also be used in other computing environments, including other clouds, and with monitoring backends other than Stackdriver. The example code uses the @opencensus/core and @opencensus/exporter-stackdriver Node.js modules.

An alternative approach to the one followed in the example would be to make OpenCensus calls direct from a web browser, skipping the backend, which is an extra resource that needs to be managed. You can indeed make GCP API calls from JavaScript in a web client, just like this example for BigQuery. However, in order to do that you will need to use an API key and the user’s credentials to go through the OAuth 2.0 flow. That is a disruptive user experience and is not possible at all for users that do not have Google accounts if saving data to Stackdriver. Therefore, that approach was ruled out.

Another measure of the health of your web application is given in server-side charts for HTTP response code counts and server latency, such as the ready-built charts provided by App Engine. However, in the event of a network problem your users may not be able to reach your server despite the fact that it is healthy. Therefore, it is important to measure the experience of the user population.

Collecting HTML related performance metrics is another area that has benefitted from new developments in JavaScript, specifically the Resource Timing candidate recommendation by the W3C released in 2017 and the Navigation Timing Level 2 released in 2018. The Resource Timing interface surfaces metrics such as DNS lookup time, TLS negotiation, and total page download time, that were difficult or impossible to measure in JavaScript previously.

A schematic diagram of the main components of the solution is shown below.

Architecture Schematic

Running the Example

The code and detailed instructions for running the example are in GitHub README.md. Once deployed, the application displays the very simple web page shown below.

Screenshot of Example Web Page

Timing statements for loading the page and the number of clicks of the ‘Click me’ button are recorded and sent to the server when the user clicks the ‘Flush’ button. Reloading the page refreshes the loading metrics and zeros the click count.

After you have deployed the app, try clicking the counter button and saving the metrics a few times. Then go to the Stackdriver console and create the charts. It may take a few minutes to propagate the metrics data. In the Stackdriver menu click Dashboards |Create Dashboard. Then in the new dashboard click Add Chart. Select the options shown in the screenshot below.

Adding a chart to the dashboard in Stackdriver

If you type part of the metric name then Stackdriver will help completing it with type-ahead. Select Resource type: global, Metric: webmetrics/latency, Group By: phase, client. Create another chart for webmetrics/click_count. After creating the dashboard, you should see something like the screenshot below.

Screenshot of Stackdriver Dashboard

Notice that the values for DNS lookup time and TLS connection time are mostly zero. That is an important piece of information in itself. DNS lookup time is zero when the browser uses a cached IP address matching the domain name. The TLS connection time may be zero if the browser already has an established connection with the server or if the protocol used is QUIC. App Engine and other Google services support QUIC, which is a UDP based protocol where data can typically be sent immediately to the client. If you try from a variety of devices and browsers you should get a few non-zero values for DNS lookup and TLS connection time.

Breaking Down Web Metrics by Category

The example uses OpenCensus Tags to provide contextual information and group related metrics. The latency metrics use a ‘phase’ tag, which represents the different phases of a HTTPS request: DNS lookup, TLS connection establishment, and transfer of the payload. The ‘client’ tag is a placeholder for the kind of client sending the information. For example, a mobile client versus a web client. Alternatively, we could use the browser user agent for the client, which can be retrieved with the Express Request object. For example, replace the line

const valueWeb = "web";

in app.js with

const valueWeb = req.header("User-Agent");

After modifying the code the dashboard looks like the screenshot shown below.

Screenshot with Breakdown by User Agent

A minor problem that we see now is that there are complex user agent strings for the different browsers. It would be more pleasant to view the different browser types and operating systems in a more simple form. That will not be too difficult to achieve with some string parsing code to extract the different browser types and operating systems.

We might also want to break performance and user engagement metrics down by geography since Internet service levels vary from area to area. Also, this will help to identify the areas of local network disruptions, such as packet loss. App Engine and Google Cloud Load Balancer (GCLB) provide additional HTTP headers that allow you to find the user’s geography in terms of country, region, and city. That information can be added using OpenCensus Tags for grouping in Stackdriver.

You may want to investigate other attributes as well, such as payload size, protocol (HTTP, HTTPS, HTTP/2, or QUIC), and network type, each of which can have a big effect on latency. The payload size can be found from the browser JavaScript property PerformanceResourceTiming.transferSize. The protocol information can be found if using a Google Cloud Load Balancer with User Defined Headers. The tls_version header includes the TLS version or QUIC, if it is used. The experimental browser JavaScript Navigator.connection property returns a NetworkInformation object that gives network type, such as ‘cellular’, ‘wifi’, and ‘ethernet.’

Error counts are another important metric to record. Specifically, HTTP errors in retrieving resources of XMLHTTPRequests and errors in browser JavaScript execution. A sudden increase in error count indicates an application health problem requiring prompt attention.

Extending to more Complex Web Applications

The example web page is similar to web sites written in plain HTML or server-side frameworks that generate HTML pages like Java Server Pages, Python with Django, or Go HTML templates. One approach for capturing more user metrics in these regular (multi-page) web applications is to use a DOM selector that will return multiple DOM elements.

The example code sends the analytics data back in a HTTP POST using the Fetch API. However, it may be more appropriate to use the Navigator.sendBeacon() API, which is specially intended for sending small amounts of analytics data to the server, instead. This can avoid potential problems in document unloading. Note that sendBeacon is not supported on all browsers yet.

Single page applications (SPAs) using frameworks like AngularJS, React, or Vue.js may need a slightly different approach to due to the dynamic nature of the DOM generated. For SPA’s the single HTML page will only be loaded once. Changing views displayed to users in SPAs are enabled by data transferred to and from the server via XMLHttpRequest (AJAX). In that context the relevant metric that should be recorded is the latency of the XMLHttpRequest calls. XMLHttpRequest latency is also captured in the Resource Timing API, so the same basic approach is still applicable. You may want to flush the data to the server on a regular heartbeat, so that an alert can be triggered as soon as a problem occurs.

The Web Fundamentals articles User-centric Performance Metrics and Assessing Loading Performance in Real Life with Navigation and Resource Timing and the Mozilla Developer Connection article Using the Resource Timing API give more details on these aspects of browser JavaScript coding.