This post celebrates the accomplishments of interns working on OpenTelemetry-related projects this summer. zPages was developed as an experimental feature; it is aligned with OpenTelemetry vision, but not blocking GA deliverables. We hope to see zPages (not a final name) move from experimental status to a feature that delights OpenTelemetry users.
What are zPages?
zPages are in-process web pages that allow users to view diagnostics of the application without any telemetry backend. Because they are built-in, zPages have a couple of unique benefits for gaining insights into instrumented applications. Some zPages have already been created for OpenTelemetry (OTel) C# and the Go Collector, while Java and C++ have initial iterations that are complete (or nearly complete) and merged to their respective repository locations. There’s also an in-progress cross-language specification for zPages.
Ramping up development with zPages is quicker than installing external tracing systems (i.e. Jaeger and Zipkin), since zPages are lighter and don’t require a database. In addition, zPages can analyze more telemetry with the limited set of supported scenarios than external exporters, especially for high-throughput applications. This is because external exporters are typically configured to send a subset of telemetry for reach analysis to save costs in an out-of-memory process.
Proprietary solutions similar to zPages are frequently used at companies like Uber and Google. Spring boot also has a similar feature called Production Ready Endpoints. Within the open-source community, zPages was first released through OpenCensus in the Java, Go, and Node.js language repositories. This post will explain what zPages are and share some technical nuances of developing zPages by comparing our experiences from the perspective of the authors of zPages for OTel Java and C++.
Types of zPages
At the moment, there are 4 types of zPages: Tracez, TraceConfigz, RPCz, and Statsz. Their uses are defined below:
- Tracez aggregates the running and completed spans from the instrumented application and displays their data in a summary-level table. Users can also view details on sampled spans, which are limited within buckets for a given span name. Tracez typically only shows client and server spans.
- TraceConfigz displays the currently active tracing configuration and allows users to change the tracing parameters in real-time.
- RPCz shows details on sent and received gRPC messages for internal spans, which is categorized by RPC methods. This includes overall and error counts, average latency per call, RPCs sent per second, and input/output size per second.
- Statsz is used for displaying metrics and measures for exported views. These views are grouped into directories using their namespaces.
For all zPages, especially web page rendering logic, there is an overlap in architecture. For simplicity, we will primarily focus on Tracez — note that all zPages have similarities in their implementations. Tracez in OTel Java and C++ currently consists of three components (the span processor, data aggregator, and HTTP server), which are described as follows:
- The span processor watches the lifecycle of each span, invoking functions each time a span starts or ends.
- The data aggregator filters and aggregates the data from the span processor into an accessible format for the UI to display.
- The HTTP server is responsible for listening to incoming requests, obtaining requested data, and rendering that information into a web-compatible format.
In general, Tracez needs to collect spans (running, error, and latency that are bucketed by completion time), store and group them by name in-memory, limit the number of spans per bucket for each name, and render them on a web page all in a scalable manner.
When developing zPages for OTel, the difficulty differs depending on a few aspects. The authors of zPages Tracez in Java and C++ discovered this and had considerably variable experiences with creating Tracez, despite the concept itself being the same. Between Java and C++, we’ll look into those differences to get more insight into the components that can make designing and zPages implementation experiences dissimilar.
Generally, programming languages themselves present interesting factors for contributing zPage and Tracez solutions; this includes thread-safety, HTTP server standardization, and other data structure availability (particularly with versioning).
Programming languages can offer several levels of thread-safety support that change how challenging it is to create zPages. Java was fairly straightforward in this regard and only needed the synchronized keyword to create blocks of code that synchronize on a particular object. For C++, Tracez used locks as an initial solution — while also simple, it’s not optimal and has to be revisited for more optimizations.
Similarly, languages have various degrees of difficulty for setting up HTTP servers to display zPages, which may require external packages or custom solutions depending on the circumstances. The HttpServer was mostly in Java, made by extending the com.sun.net.httpserver class. By contrast, the C++ HTTP used a custom class that will require changes to ensure application security and thread-safety.
OTel libraries are optimized to run on as many versions of a language as possible, which can add obstacles to zPages development. For instance, OTel Java uses Java 7, so streams and collectors were unavailable to use for creating aggregations. C++ couldn’t use read-write locks to allow concurrent reads due to the C++ version the OTel repository uses.
OpenTelemetry and zPages are generally new, so they lack some documentation and code examples. zPages documentation and code in specific languages also can play a considerable role in difficulty. With more documentation being added for zPages and OTel, including this article, it’s anticipated future planning and development should be smoother.
Some obstacles that arose from limited amounts of previous open-source documentation and examples for zPages and Tracez for both Java and C++ included having few references for installing and instrumenting applications using OpenTelemetry. To determine whether a feature was supported, both teams needed to search through their respective repositories for any relevant APIs or classes and there weren’t design docs to reference. Creating useful demos and examples for zPages was also a challenge because of this.
Since Java Tracez were implemented in OpenCensus, they substantially influenced design choices for the OTel Java zPages despite different components. Contrastingly, zPages had never been implemented in C++ before in a popular open source library. Languages vary in features and best practices, so adding zPages in OTel C++ meant there was less guidance on language specific best practices for solutions like thread-safety and HTTP servers.
OTel repositories for specific languages also have varying degrees of maturity. This determines how many missing or unstable parts there are, which influence design choices and ease of development.
OpenTelemetry Java was in beta during its zPages development. That meant that almost everything needed to implement the Tracez was present, except a few missing aspects like attributes. OTel C++ is currently pre-alpha, so many specification features (such as span and trace IDs) were unimplemented. Other workarounds to fill in gaps were also required; this includes creating a custom thread-safe span data class to access running span data, and choosing not to render span events, since they weren’t stable at the time.
Other Design Possibilities
Another part that makes creating zPages interesting is the choices available for deciding what work is done across components and how that work is distributed. The two main distinct design choices between the OTel Java and C++ Tracez teams were 1) aggregation and pruning of spans (since it’s not practical to save every span in-memory) and 2) how data is rendered in the HTTP server. Approaches used by both teams are viable, and individual contexts for any given OTel repository should be considered when making decisions. This is not an exhaustive list of all the possible areas to innovate solutions, and we encourage exploration in other areas as well.
Java Tracez in OTel stores and imposes span sampling limits for buckets within each span name all within the span processor. This Tracez implementation also has a stateless aggregator that returns data in a more HTTP server friendly format when a user tries to view zPages, rather than doing so in a background thread. Contrastingly, C++, Tracez has both the span processor and data aggregator store spans. This stateful data aggregator runs in the background and takes ownership of all completed spans, organizes spans by name, increments bucket counts, and enforces span limits at periodic intervals.
There are trade-offs for both designs. OTel Java Tracez’s samples will be updated as soon as they change, but they’re more prone to high memory usage. C++ benefits memory and performance wise from batching aggregation work, but may can have more stale data. Implementations can also differ in the number of sampled spans they store, like Java will only keep up to 16 and 8 spans for the latency and error buckets respectively while C++ stores up to 5 for any given bucket.
Try it out
If you want to try running the OTel C++ Tracez example pictured in the article, instructions can be found here. You can also try Java OTel’s Tracez and TraceConfigz solution, or read its in-depth walkthrough on the Google OSS blog. You can also try RPCz in OTel C#.
With scalable observability being more important than ever to ensure reliability and high performance of large systems, solutions similar to zPages are highly utilized in the industry to innovate application insights for their ease of use and unique properties. zPages are a useful, technically challenging, and highly visible feature — we strongly encourage its implementation within all OTel language repositories. Thanks for reading, and we hope you enjoyed learning about the uses and challenges of zPages.
If you’re interested in contributing through zPages, you can read its experimental specification and start a pull request in any OTel language repository today!
Thank you all who helped us this summer. Thank you Sergey Kanzhelev, Amelia Mango, and Morgan McLean for review and edits of this article! Also, special shoutout to Sergey for his support with the zPages spec.