zPages in OpenTelemetry

Janet Vu
Janet Vu
Aug 28 · 9 min read

This post celebrates the accomplishments of interns working on OpenTelemetry-related projects this summer. zPages was developed as an experimental feature; it is aligned with OpenTelemetry vision, but not blocking GA deliverables. We hope to see zPages (not a final name) move from experimental status to a feature that delights OpenTelemetry users.

What are zPages?

zPages are in-process web pages that allow users to view diagnostics of the application without any telemetry backend. Because they are built-in, zPages have a couple of unique benefits for gaining insights into instrumented applications. Some zPages have already been created for OpenTelemetry (OTel) C# and the Go Collector, while Java and C++ have initial iterations that are complete (or nearly complete) and merged to their respective repository locations. There’s also an in-progress cross-language specification for zPages.

Ramping up development with zPages is quicker than installing external tracing systems (i.e. Jaeger and Zipkin), since zPages are lighter and don’t require a database. In addition, zPages can analyze more telemetry with the limited set of supported scenarios than external exporters, especially for high-throughput applications. This is because external exporters are typically configured to send a subset of telemetry for reach analysis to save costs in an out-of-memory process.

Proprietary solutions similar to zPages are frequently used at companies like Uber and Google. Spring boot also has a similar feature called Production Ready Endpoints. Within the open-source community, zPages was first released through OpenCensus in the Java, Go, and Node.js language repositories. This post will explain what zPages are and share some technical nuances of developing zPages by comparing our experiences from the perspective of the authors of zPages for OTel Java and C++.

Image for post
Image for post
Screenshot of the OTel C++ Tracez zPage when viewed within a browser, which shows aggregation counts for two span names.

Types of zPages

At the moment, there are 4 types of zPages: Tracez, TraceConfigz, RPCz, and Statsz. Their uses are defined below:

  • Tracez aggregates the running and completed spans from the instrumented application and displays their data in a summary-level table. Users can also view details on sampled spans, which are limited within buckets for a given span name. Tracez typically only shows client and server spans.
Image for post
Image for post
The detailed sampled running span view for OTel C++ Tracez, which is shown when a running bucket is clicked.
Image for post
Image for post
Another detailed sampled span view, but for latency spans; clicking on error sampled spans have a similar view as well.

Tracez Architecture

For all zPages, especially web page rendering logic, there is an overlap in architecture. For simplicity, we will primarily focus on Tracez — note that all zPages have similarities in their implementations. Tracez in OTel Java and C++ currently consists of three components (the span processor, data aggregator, and HTTP server), which are described as follows:

  • The span processor watches the lifecycle of each span, invoking functions each time a span starts or ends.

In general, Tracez needs to collect spans (running, error, and latency that are bucketed by completion time), store and group them by name in-memory, limit the number of spans per bucket for each name, and render them on a web page all in a scalable manner.

Image for post
Image for post
Diagram of the basic data flow from the three Tracez parts described above. The span processor collects span information from an application via the sampler.

Implementation Variation

When developing zPages for OTel, the difficulty differs depending on a few aspects. The authors of zPages Tracez in Java and C++ discovered this and had considerably variable experiences with creating Tracez, despite the concept itself being the same. Between Java and C++, we’ll look into those differences to get more insight into the components that can make designing and zPages implementation experiences dissimilar.

Language

Generally, programming languages themselves present interesting factors for contributing zPage and Tracez solutions; this includes thread-safety, HTTP server standardization, and other data structure availability (particularly with versioning).

Programming languages can offer several levels of thread-safety support that change how challenging it is to create zPages. Java was fairly straightforward in this regard and only needed the synchronized keyword to create blocks of code that synchronize on a particular object. For C++, Tracez used locks as an initial solution — while also simple, it’s not optimal and has to be revisited for more optimizations.

Similarly, languages have various degrees of difficulty for setting up HTTP servers to display zPages, which may require external packages or custom solutions depending on the circumstances. The HttpServer was mostly in Java, made by extending the com.sun.net.httpserver class. By contrast, the C++ HTTP used a custom class that will require changes to ensure application security and thread-safety.

OTel libraries are optimized to run on as many versions of a language as possible, which can add obstacles to zPages development. For instance, OTel Java uses Java 7, so streams and collectors were unavailable to use for creating aggregations. C++ couldn’t use read-write locks to allow concurrent reads due to the C++ version the OTel repository uses.

Literature

OpenTelemetry and zPages are generally new, so they lack some documentation and code examples. zPages documentation and code in specific languages also can play a considerable role in difficulty. With more documentation being added for zPages and OTel, including this article, it’s anticipated future planning and development should be smoother.

Some obstacles that arose from limited amounts of previous open-source documentation and examples for zPages and Tracez for both Java and C++ included having few references for installing and instrumenting applications using OpenTelemetry. To determine whether a feature was supported, both teams needed to search through their respective repositories for any relevant APIs or classes and there weren’t design docs to reference. Creating useful demos and examples for zPages was also a challenge because of this.

Since Java Tracez were implemented in OpenCensus, they substantially influenced design choices for the OTel Java zPages despite different components. Contrastingly, zPages had never been implemented in C++ before in a popular open source library. Languages vary in features and best practices, so adding zPages in OTel C++ meant there was less guidance on language specific best practices for solutions like thread-safety and HTTP servers.

Repository Maturity

OTel repositories for specific languages also have varying degrees of maturity. This determines how many missing or unstable parts there are, which influence design choices and ease of development.

OpenTelemetry Java was in beta during its zPages development. That meant that almost everything needed to implement the Tracez was present, except a few missing aspects like attributes. OTel C++ is currently pre-alpha, so many specification features (such as span and trace IDs) were unimplemented. Other workarounds to fill in gaps were also required; this includes creating a custom thread-safe span data class to access running span data, and choosing not to render span events, since they weren’t stable at the time.

Other Design Possibilities

Another part that makes creating zPages interesting is the choices available for deciding what work is done across components and how that work is distributed. The two main distinct design choices between the OTel Java and C++ Tracez teams were 1) aggregation and pruning of spans (since it’s not practical to save every span in-memory) and 2) how data is rendered in the HTTP server. Approaches used by both teams are viable, and individual contexts for any given OTel repository should be considered when making decisions. This is not an exhaustive list of all the possible areas to innovate solutions, and we encourage exploration in other areas as well.

Java Tracez in OTel stores and imposes span sampling limits for buckets within each span name all within the span processor. This Tracez implementation also has a stateless aggregator that returns data in a more HTTP server friendly format when a user tries to view zPages, rather than doing so in a background thread. Contrastingly, C++, Tracez has both the span processor and data aggregator store spans. This stateful data aggregator runs in the background and takes ownership of all completed spans, organizes spans by name, increments bucket counts, and enforces span limits at periodic intervals.

There are trade-offs for both designs. OTel Java Tracez’s samples will be updated as soon as they change, but they’re more prone to high memory usage. C++ benefits memory and performance wise from batching aggregation work, but may can have more stale data. Implementations can also differ in the number of sampled spans they store, like Java will only keep up to 16 and 8 spans for the latency and error buckets respectively while C++ stores up to 5 for any given bucket.

Image for post
Image for post
Diagram of the OTel Java Tracez HTTP server web page rendering logic.
Image for post
Image for post
Diagram of the OTel C++ Tracez HTTP server web page rendering logic. The data and UI layers are separated here.

In regards to the HTTP server, the OTel Java zPages follows the example of what OpenCensus Java does with server-side rendered HTML displaying the data directory. C++ zPages differs by statically rendering HTML, adding the data to the HTML DOM through Javascript and client-side rendering; this done by adding a REST API that separates the data and UI layer. This approach means the HTTP server will send static files or the stored aggregation data in JSON form depending on the URL endpoint hit; this allows users to view purely data in a readable format as desired. Another benefit of adding client-side rendering is that static file logic is easier to reason and share across OTel repositories. A trade-off is that there’s extra computation required to translate data to JSON strings, which potentially adds a dependency. Browser-wise, a pure server-side has quicker initial renders of the webpages and doesn’t require Javascript. Adding client-side functionality means Javascript is required within a browser, but viewing sampled spans is quicker because static files and the entire DOM aren’t re-rendered.

Try it out

If you want to try running the OTel C++ Tracez example pictured in the article, instructions can be found here. You can also try Java OTel’s Tracez and TraceConfigz solution, or read its in-depth walkthrough on the Google OSS blog. You can also try RPCz in OTel C#.

Conclusion

With scalable observability being more important than ever to ensure reliability and high performance of large systems, solutions similar to zPages are highly utilized in the industry to innovate application insights for their ease of use and unique properties. zPages are a useful, technically challenging, and highly visible feature — we strongly encourage its implementation within all OTel language repositories. Thanks for reading, and we hope you enjoyed learning about the uses and challenges of zPages.

If you’re interested in contributing through zPages, you can read its experimental specification and start a pull request in any OTel language repository today!

Thank you all who helped us this summer. Thank you Sergey Kanzhelev, Amelia Mango, and Morgan McLean for review and edits of this article! Also, special shoutout to Sergey for his support with the zPages spec.

Authors

Image for post
Image for post
Image for post
Image for post
OTel Java zPages Interns: William Hu (Yale University, LinkedIn) and Terry Wang (University of Waterloo, LinkedIn).
Image for post
Image for post
Image for post
Image for post
OTel C++ zPages Interns: Janet Vu (University of Michigan, LinkedIn) and Keshav Manghat (South Dakota School of Mines, LinkedIn).

OpenTelemetry

OpenTelemetry makes robust, portable telemetry a built-in…

Thanks to Sergey Kanzhelev

Janet Vu

Written by

Janet Vu

OpenTelemetry

OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software, and is the next major version of both OpenTracing and OpenCensus.

Janet Vu

Written by

Janet Vu

OpenTelemetry

OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software, and is the next major version of both OpenTracing and OpenCensus.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store