Logging, tracing and other heavy lifting

Published in

Towards Application Data Monitoring

2 min readDec 28, 2020

For microservices-based systems running at scale, observability is a must. From searching logs for insightful patterns to visualizing performance metrics to drilling into individual traces, there are several distinct needs and products for each need. However, a rarely discussed but important aspect of observability is the foundational work and constant effort that is needed to get value from today’s approaches and tools.

Independent of the relative merits of Prometheus vs. StatsD, to analyze any metric today, a developer has to have the foresight to realize they will need it in the future and publish it from their service for later consumption in a metrics aggregation and visualization tool. When seeking advice on what to instrument, the short answer is typically — instrument everything. This is for good reason. If something fails, it is more likely than not, that it is failing in an unexpected way (otherwise you would have built the recovery logic in, right?).

The goal of needing to understand how unknown patterns may impact your system, leads to a substantial amount of work to publish and aggregate every conceivable metric. Most of these will never be examined or prove useful.

In reality, developers decide which metrics to publish after they’ve actually encountered a problem and realized that they’re missing a helpful indicator. It’s not a lack of foresight that is to blame. Instrumentation takes work. Code has to be written to publish the right metrics in the most efficient way which then need to be consumed and displayed in an intuitive manner to provide any value.

This developer burden is one that we are alleviating at Layer 9. Taking on substantial effort for unclear benefit is not an attractive proposition. We decided that developers should not have to make choices about individual metrics, sampling rates or batching up front. They have enough on their plates already. Our platform collects metrics for every service without any instrumentation effort. We’ve also built machine learning models to detect patterns and find anomalies. This means that there is a very large catalog of metrics, from throughput to latency to data volume anomalies and more, available for analysis in realtime, in the event that a service failure or anomaly is detected.

All of this is made available in an intuitive interface that lets you navigate your service graph at a high level or drill into the statistics for a single schema element in a service endpoint. Any metric you could want, collected for you and delivered when you need it the most.

—

Photo by Antonio Grosz on Unsplash

If you’re interested in learning more, please drop us a note via layer9.ai or follow us on Twitter @layer9ai or on LinkedIn @ Layer 9 AI

Logging, tracing and other heavy lifting

Written by Arjun Dutt