Build an end to end JSON logging system for clients apps

Published in

Pinterest Engineering Blog

5 min readJan 10, 2023

Liang Ma | Software Engineer, Core Eng; Wei Zhu | Software Engineer, Observability

Flow map: Pinterest app to JSON logs batch to Logservice — a batch logging endpoint (/log) that handles perf logs, device info(Android) and new JSON log type to json messages to Singer to Pub/Sub with arrows to Logstash, Merced and Other analytics tools. Logstash goes to Open Search. Open Search goes to OpenSearch Dashboards and Metric generator to Statsboard. Merced goes to S3/Hive.

In early 2020, during a critical iOS out of memory incident (we have a blogpost for that), we realized that we didn’t have much visibility of how the app is running or a good system to look up for monitoring and troubleshooting.

State of logging

At that time, on the client side, there were a few ways for logging in their daily work:

Context logging: built for logging and reporting impressions or anything related to business, thus a time critical and first-class endpoint. Developers need to explicitly define keys that would otherwise be rejected by the endpoint. Some companies call it “analytics logging.”
Misc: logging to a local file on disk, or even logging to a crash tracking service as an error type.

The problems are:

Not all logs fall into those categories, and people often abuse certain types of logging
None of these tools provide a good way to visualize or aggregate. For example, developers need to make code changes to populate information like “what the metric looks like on app version A, on device B, and under network type C”
There isn’t a system that can easily monitor logs in a real-time way, not to mention set up real-time alerts with log-based custom metrics.

Goal

We decided to create an end-to-end pipeline with the following characteristics:

It’s built with the least resistance: log payload is schemaless and flexible, basically key-value pairs. That’s one of the reasons we call it JSON logging.
It’s ready to use logging APIs on each platform
Developers don’t need to touch any backend stuff
It’s easy to query and visualize logs
Performs in real-time!

With these in mind, the following key design decisions were made:

The logging service endpoint will handle logs validating, parsing, and processing.
Logs will be persisted in hive, thus supporting any SQL-based queries.
A single and shared Kafka topic will be used for all logs going through this pipeline.
It’s integrated with OpenSearch (Amazon’s fork of Elasticsearch and Kibana) as a real time visualization and query tool.
It will be easy to set up real-time alerting with log-based custom metrics.

Architecture

High level

Schema

Client side service integration will provide the metadata, and developers just need to provide the name of the log and actual log payload. Nothing else is required.

A sample payload

{ “name” = “network_metrics”; //required, set by users “timestamp” = 2022121512345; //required, set by pipeline “metadata” = { //required, set by pipeline “app_version” = “8.40”; “os_version” = “14.0”; “device_model” = “IPHONE11,2; “build_type” = “Production” // “OTA”, “Development”, “Alpha”, etc “network_type” = “wifi” // or “cellular” “country” = “United States”; “platform” = “Android”; … }; “payload” = { // users reported payload will appear here }; };

Visualize and query

Visualization of logs on Opensearch is relatively simple following the self-service guidance provided for this pipeline. Also, developers can use SQL query and any other query/visualization tools that are supported by this pipeline to query.

Example on how to visualize network metrics in real-time with six separate graphics: mobile_json)log::platforms, mobile_networking::host, mobile_json_log::total_count_timeline, mobile_networking::req_num_by_ver, mobile_networking::request_latency, and mobile_networking::status. — Figure 2 — a sample dashboard of network logs from both iOS and Android apps

Real-time alerting

Log-based metrics are a cost-efficient way to summarize log data from the entire ingest stream. With log-based metrics, users can generate a count metric of logs that match a Lucene query. For more advanced use cases, users can generate metrics from an OpenSearch term aggregation query to dissect log data across different dimensions.

Example on how to create a log-based metric. “Succeeded. Metric Name: es.mobile_json.story_pin_by_event_type. Query Name: name: story_pin_creation_event AND metadata.build_type:Production. Index Name: mobile_json_log. Begin: -30mins. End: -5min. Term Aggs (optional) Field: payload.eventType.key. Tag Key: event_type. Size: 10. Order: desc. Field: metadata.platform.key. Tag Key: platform. Size: 10. Order: desc.” — Figure 3 — example: how to create a log-based metric

Log-based metrics can be used to build dashboards and real-time alerts:

Title of Tab: ES Mobile JSON Story Pin Event. sum_aggregator: zimsum:1m-avg-none. Two Stratsboards with red lines and dots titled “iOS story pin event SR” and “Android story pin event SR”. — Figure 4 — example: a real-time alerting set up based on the log-based metric, on Statsboard

Use-cases

Since this pipeline was built up without any real push, developers have been proactively adopting this logging system mainly for:

Client visibility

Networking metrics and crash metrics so they know better how the clients perform and get that client side signals to the topline Pinner Uptime metric
Performance insight, such as information provided by iOS MetricKit
Custom error reporting, such as exceptions, soft errors, and assertions that were previously either not reported or reported somewhere and didn’t have a good tool to analyze

Product surface/feature SLA

Some product teams leverage this system to report product feature health, such as Pin creation results, so they can monitor success/failure rates in real-time. This often catches issues way earlier than the usual daily metric aggregation, and it’s especially useful for issues that API side monitoring wouldn’t alert right away.

Developer logs

Developers like to use this pipeline to gain visibility of certain logic or code paths on production, e.g. “has this code ever run?,”, “how often does this happen?”, and many similar questions that no one can answer except the data.
Developers add logs to help troubleshoot odd bugs that are very hard to reproduce locally or issues that only occur on certain device models, OS versions, etc.

Real Time alerting

Because of the ease of reporting and alerting setup, product teams often use that just for the sake of real-time alerting.

Future

On the Opensearch side, create sub-level indexes by name, which could boost query performance and also better isolate logs
Explore the alerting function provided by Opensearch

Acknowledgements: huge thanks to Stephen Blanco, Darren Gyles, Sha Sha Chu, Nadine Harik, Roger Wang, and our data & infra team for their contribution, feedback and support.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore life at Pinterest, visit our Careers page.