Web Metrics with OpenCensus and Stackdriver
Collection of metrics for web applications is an important task necessary for web application owners to understand performance, user experience, and user engagement. There are three main uses cases and three corresponding flavors of metrics:
- The health of the business — as indicated by business and user engagement metrics such as number of active users and ‘conversions,’ which may be new registrations or online sales, on a day-to-day or month-to-month timeframe
- Performance as experienced by the developer— a developer’s detailed view of latency and resource usage data from their own perspective as the application user, done before the code is released
- Operational health of the application — real time latency as experienced by the user population, for example, resource loading times for users in different geographies, and aggregate counts of errors and user driven events, on a minute-to-minute basis
The third use case is the focus of this post.
Examples of existing fully managed solutions for the first use case include Google Analytics and the Google Marketing Platform, which include very powerful features for analysis of business metrics. The second use case is addressed by tools such as Chrome Developer Tools. To approach described here for the third use case enables give full control over the collection, storage, presentation, and analysis of metrics data for the population of users in near realtime. For example, you can formulate alerts through Stackdrvier Alerts to page oncall System Reliability Engineering (SRE) staff when increased client errors or elevated latency are detected.
If all this seems like it is overkill for you application, you may want to try an uptime service like Stackdriver uptime alerts.
In the example, the server runs Express and can be deployed to App Engine Flex or anywhere else Express is supported. The OpenCensus APIs can also be used in other computing environments, including other clouds, and with monitoring backends other than Stackdriver. The example code uses the @opencensus/core and @opencensus/exporter-stackdriver Node.js modules.
Another measure of the health of your web application is given in server-side charts for HTTP response code counts and server latency, such as the ready-built charts provided by App Engine. However, in the event of a network problem your users may not be able to reach your server despite the fact that it is healthy. Therefore, it is important to measure the experience of the user population.
A schematic diagram of the main components of the solution is shown below.
Running the Example
The code and detailed instructions for running the example are in GitHub README.md. Once deployed, the application displays the very simple web page shown below.
Timing statements for loading the page and the number of clicks of the ‘Click me’ button are recorded and sent to the server when the user clicks the ‘Flush’ button. Reloading the page refreshes the loading metrics and zeros the click count.
After you have deployed the app, try clicking the counter button and saving the metrics a few times. Then go to the Stackdriver console and create the charts. It may take a few minutes to propagate the metrics data. In the Stackdriver menu click Dashboards |Create Dashboard. Then in the new dashboard click Add Chart. Select the options shown in the screenshot below.
If you type part of the metric name then Stackdriver will help completing it with type-ahead. Select Resource type: global, Metric: webmetrics/latency, Group By: phase, client. Create another chart for webmetrics/click_count. After creating the dashboard, you should see something like the screenshot below.
Notice that the values for DNS lookup time and TLS connection time are mostly zero. That is an important piece of information in itself. DNS lookup time is zero when the browser uses a cached IP address matching the domain name. The TLS connection time may be zero if the browser already has an established connection with the server or if the protocol used is QUIC. App Engine and other Google services support QUIC, which is a UDP based protocol where data can typically be sent immediately to the client. If you try from a variety of devices and browsers you should get a few non-zero values for DNS lookup and TLS connection time.
Breaking Down Web Metrics by Category
The example uses OpenCensus Tags to provide contextual information and group related metrics. The latency metrics use a ‘phase’ tag, which represents the different phases of a HTTPS request: DNS lookup, TLS connection establishment, and transfer of the payload. The ‘client’ tag is a placeholder for the kind of client sending the information. For example, a mobile client versus a web client. Alternatively, we could use the browser user agent for the client, which can be retrieved with the Express Request object. For example, replace the line
const valueWeb = "web";
in app.js with
const valueWeb = req.header("User-Agent");
After modifying the code the dashboard looks like the screenshot shown below.
A minor problem that we see now is that there are complex user agent strings for the different browsers. It would be more pleasant to view the different browser types and operating systems in a more simple form. That will not be too difficult to achieve with some string parsing code to extract the different browser types and operating systems.
We might also want to break performance and user engagement metrics down by geography since Internet service levels vary from area to area. Also, this will help to identify the areas of local network disruptions, such as packet loss. App Engine and Google Cloud Load Balancer (GCLB) provide additional HTTP headers that allow you to find the user’s geography in terms of country, region, and city. That information can be added using OpenCensus Tags for grouping in Stackdriver.
Extending to more Complex Web Applications
The example web page is similar to web sites written in plain HTML or server-side frameworks that generate HTML pages like Java Server Pages, Python with Django, or Go HTML templates. One approach for capturing more user metrics in these regular (multi-page) web applications is to use a DOM selector that will return multiple DOM elements.
The example code sends the analytics data back in a HTTP POST using the Fetch API. However, it may be more appropriate to use the Navigator.sendBeacon() API, which is specially intended for sending small amounts of analytics data to the server, instead. This can avoid potential problems in document unloading. Note that sendBeacon is not supported on all browsers yet.
Single page applications (SPAs) using frameworks like AngularJS, React, or Vue.js may need a slightly different approach to due to the dynamic nature of the DOM generated. For SPA’s the single HTML page will only be loaded once. Changing views displayed to users in SPAs are enabled by data transferred to and from the server via XMLHttpRequest (AJAX). In that context the relevant metric that should be recorded is the latency of the XMLHttpRequest calls. XMLHttpRequest latency is also captured in the Resource Timing API, so the same basic approach is still applicable. You may want to flush the data to the server on a regular heartbeat, so that an alert can be triggered as soon as a problem occurs.