Build an end to end JSON logging system for clients apps
Liang Ma | Software Engineer, Core Eng; Wei Zhu | Software Engineer, Observability
In early 2020, during a critical iOS out of memory incident (we have a blogpost for that), we realized that we didn’t have much visibility of how the app is running or a good system to look up for monitoring and troubleshooting.
State of logging
At that time, on the client side, there were a few ways for logging in their daily work:
- Context logging: built for logging and reporting impressions or anything related to business, thus a time critical and first-class endpoint. Developers need to explicitly define keys that would otherwise be rejected by the endpoint. Some companies call it “analytics logging.”
- Misc: logging to a local file on disk, or even logging to a crash tracking service as an error type.
The problems are:
- Not all logs fall into those categories, and people often abuse certain types of logging
- None of these tools provide a good way to visualize or aggregate. For example, developers need to make code changes to populate information like “what the metric looks like on app version A, on device B, and under network type C”
- There isn’t a system that can easily monitor logs in a real-time way, not to mention set up real-time alerts with log-based custom metrics.
We decided to create an end-to-end pipeline with the following characteristics:
- It’s built with the least resistance: log payload is schemaless and flexible, basically key-value pairs. That’s one of the reasons we call it JSON logging.
- It’s ready to use logging APIs on each platform
- Developers don’t need to touch any backend stuff
- It’s easy to query and visualize logs
- Performs in real-time!
With these in mind, the following key design decisions were made:
- The logging service endpoint will handle logs validating, parsing, and processing.
- Logs will be persisted in hive, thus supporting any SQL-based queries.
- A single and shared Kafka topic will be used for all logs going through this pipeline.
- It’s integrated with OpenSearch (Amazon’s fork of Elasticsearch and Kibana) as a real time visualization and query tool.
- It will be easy to set up real-time alerting with log-based custom metrics.
Client side service integration will provide the metadata, and developers just need to provide the name of the log and actual log payload. Nothing else is required.
A sample payload
Visualize and query
Visualization of logs on Opensearch is relatively simple following the self-service guidance provided for this pipeline. Also, developers can use SQL query and any other query/visualization tools that are supported by this pipeline to query.
Log-based metrics are a cost-efficient way to summarize log data from the entire ingest stream. With log-based metrics, users can generate a count metric of logs that match a Lucene query. For more advanced use cases, users can generate metrics from an OpenSearch term aggregation query to dissect log data across different dimensions.
Log-based metrics can be used to build dashboards and real-time alerts:
Since this pipeline was built up without any real push, developers have been proactively adopting this logging system mainly for:
- Networking metrics and crash metrics so they know better how the clients perform and get that client side signals to the topline Pinner Uptime metric
- Performance insight, such as information provided by iOS MetricKit
- Custom error reporting, such as exceptions, soft errors, and assertions that were previously either not reported or reported somewhere and didn’t have a good tool to analyze
Product surface/feature SLA
- Some product teams leverage this system to report product feature health, such as Pin creation results, so they can monitor success/failure rates in real-time. This often catches issues way earlier than the usual daily metric aggregation, and it’s especially useful for issues that API side monitoring wouldn’t alert right away.
- Developers like to use this pipeline to gain visibility of certain logic or code paths on production, e.g. “has this code ever run?,”, “how often does this happen?”, and many similar questions that no one can answer except the data.
- Developers add logs to help troubleshoot odd bugs that are very hard to reproduce locally or issues that only occur on certain device models, OS versions, etc.
Real Time alerting
- Because of the ease of reporting and alerting setup, product teams often use that just for the sake of real-time alerting.
- On the Opensearch side, create sub-level indexes by name, which could boost query performance and also better isolate logs
- Explore the alerting function provided by Opensearch
Acknowledgements: huge thanks to Stephen Blanco, Darren Gyles, Sha Sha Chu, Nadine Harik, Roger Wang, and our data & infra team for their contribution, feedback and support.