Pharos: The Observability Platform at Workday

Published in

Workday Technology

8 min readJul 8, 2022

Distributed systems have been growing rapidly to meet the demands of emerging applications such as business analytics, biomedical informatics, media streaming applications, and so on. With this rapid growth comes inexorably increasing complexity, which is often at odds with the challenge of guaranteeing important quality of service properties (QoS) such as reliability, availability, and performance. To alleviate this, observability is emerging as a key capability of modern distributed systems that uses telemetry data collected during the runtime for debugging and maintaining complex distributed applications to ensure their quality of service. Observability also explores properties and patterns not defined in advance. In other words, observability enables users to harness new insights from the monitored telemetry data as applications grow in complexity.

One of the main goals of observability is to minimize “time to insight”, which is defined as the time for understanding what is happening in the system. The other goal is to minimize “time to resolve”, which is defined as the time for alleviating QoS issues such as reliability, availability, and performance [3, 4, 5, 6]. That is why observability is also a time-sensitive process and it is more than monitoring. It is inherently a data intensive (or BigData) application as it ingests, processes, stores, and analyzes vast amounts of telemetry data. Moreover, observing the quality of telemetry data is also critical to give more confidence to the users of the services. As shown in Fig.1, a typical observability system contains the following five components in line with a typical data pipeline system.

Fig.1 A typical Observability Platform Architecture

Data Collection: collect telemetry data from different sources using agents or collectors
Data Ingestion: ingest data from sources to the processing component via a message bus
Data Processing: process the telemetry data to provide structure (or schema) to it if necessary
Data Storage: organize and store processed data
Data Analysis: analyze incoming data, detect anomalies, generate alerts, and present data in graphs, charts, etc.

Telemetry data is broadly divided into logs, metrics, and traces. Logs are structured, semi-structured, and unstructured lines of text produced by applications (or systems). In other words, a log is a record of an event that happened within an application or a system. They are easy to generate as most application frameworks, libraries, and languages come with support for logging. Logs usually contain contextual information that can help uncover unpredictable and emergent behaviors exhibited by a service [3, 4]. Metrics are quantitative measurements that can be used to determine a service or component’s overall behavior over time. Unlike a log, which records specific events, metrics are a measured value derived from system performance, reliability, etc. Metrics also include business or application specific metrics. Typically metrics are used to trigger alerts whenever a system value goes above a certain threshold.

Fig.2 Observability is a BigData problem

While logs and metrics are adequate for understanding individual system (or component) behavior and performance, Traces help to uncover bottlenecks, identify and resolve issues faster, and prioritize high-value areas for optimization and improvements. A trace represents the entire path of a request as it moves through all the components of a distributed system, somewhat similar to call-graphs [3, 4, 5, 6]. All types of the telemetry data together help us to understand the behavior of complex modern distributed systems. As shown in Fig.2, a large-scale BigData infrastructure is needed for collecting, processing, and analyzing telemetry data for building a modern observability system. In essence, the observability system not only evaluates signals in near real-time to proactively alert teams for quick issue resolution, but it also democratizes data access by providing analytics and data engineering capabilities.

To collect telemetry data, agents (e.g., fluentbit, telegraf [8, 9, 10]) are used on the nodes where applications are running. Through either a pull or push model, the telemetry data is ingested, and then it is processed to extract insights or to provide a common schema (or structure). Then the structured data is sent to various services (or stored in various data stores) for various applications. For example, logs are typically stored in elastic search services (indexed) so that users can run rich queries and obtain insightful information regarding the behavior of their system. And alerting and data visualization are done by data analytics systems such as Kibana [11].

In this article, we provide an overview of Pharos, the Observability platform built on Public Cloud for various (Software as a Service — Saas) applications of Workday. Workday Inc., provides enterprise financial management, human capital management, etc., SaaS services.

Pharos: The Observability Platform at Workday

Design Goals. We set the following goals while designing Pharos, our Observability platform at Workday.

Generic: Typically Observability systems target specific applications or have specific goals such as performance debugging, root cause analysis of security failures. However, our goal is to design a generic Observability platform for as many different applications as possible, for example, supporting threat intelligence applications, data analytics applications, and so on.
High Reliability & Availability. High availability of the platform and as well as the telemetry data. And high resilience to failures, for example, having graceful shutdown of the Observability data pipelines during failures, i.e., clean up resources before terminating the pipelines.
Scalability. Making sure to scale the platform for various applications and meet SLAs along with increasing telemetry data volumes.
Data Quality or Data Observability. Users neither lose the data nor have bad quality data. Here quality means not having unexpected modifications to the telemetry data.
Efficient. Our platform’s operation should be as efficient as possible. Having a right balance between performance and resource utilization via adaptive resource management. And minimizing analysis time to derive meaningful alerts.

Fig.3 Pharos: The Observability Platform at Workday Inc.

Data Model & Data Storage. The input format of logs is “avro” and we have chosen “parquet” format for output as it is widely used and it has inherently better support for nested structure. We also built a schema evolution logic as part of the processing stage (i.e., Processing Framework in Fig.3) of the platform to handle changes in log messages.

For achieving high data quality (data observability) we create a common telemetry data schema to handle data loss scenarios. We define the common schema as follows which is inline with a trace event. The checksum with the unique id here helps to realize the goal of “data observability”.

LogEvent (common schema)

Id — String: A unique id representing the log/trace.
App/Service Id — String: An id of the service/app
Timestamp — Long: Start time of operation.
Name — String: Name of the event
Tags — Set<String, TypedValue>: Set of key value pairs (Optional) [4].
Payload — String/Bytes: the actual data (Optional)
Checksum of the payload — String

OpenSource telemetry data model [8] is used for Metrics, which is a series of timestamped samples where each sample consists of a float64 value and a millisecond-precision timestamp. They are associated with labels or tags [4].

Fig.3 shows a high-level architecture of our Observability platform, Pharos. Fluentbit agents are used at Data Source Systems to collect telemetry data and push it into the Collection Gateway. The Collection Gateway is a containerized application which proxies telemetry data to downstream Processing Framework [1, 2]. Moreover, it allows us to apply policies (e.g., data retention policies) on the data. Kafka running on Public Cloud Kubernetes Service is used as a Message Bus.

The Processing Framework provides both batch and stream processing capabilities, for example, using Spark for batch processing and Flink for real-time processing. The data management layer provides different capabilities such as data policy management, data security, data governance, etc. For example, the ingestion gateway gets data recreation policy information from the data management component. We used Public Cloud data services including object stores, elastic search, as data sink systems.

A data access layer (API/Service) or query engine is developed based on Spark to execute rich queries and extract insights. We also used data analytics systems such as Kibana for alerting and dashboards (data visualizations).

Keeper: Observing Pharos (the Observability Platform)

As we can see in Fig.3, there are many moving parts in Pharos. Therefore it is necessary to have the ability to observe Pharos to improve its reliability, availability, and other QoS features. To alleviate the above issues, we designed a lightweight observability system, Keeper, using Public Cloud managed services for Monitoring & Alerting, Elastic Search, to observe Pharos.

Fig 4. Keeper: A system using Public Cloud managed services to observe Pharos

As we can see in Fig.4, the system uses Public Cloud managed services to ingest, process, and store telemetry data. Similar to Pharos, various agents can be used to collect telemetry data. Public Cloud managed “Monitoring” tools are used for collecting both infrastructure metrics such as CPU utilization and custom application metrics such as “number of log statements ingested”. ElasticSearch and other widely used telemetry data stores are used for storage. Data analysis and visualization systems such as Kibana are used for alerting and data visualization.

Other alternatives are either using Pharos itself to observe it or using another instance of Pharos. The advantages of these alternatives compared to the above mentioned system based on Public Cloud managed services are: a) it is easy to extend the platform to custom use cases; b) it helps in improving the stability of the software/platform over time. However, the disadvantages of these alternatives are: a) the circular dependency, i.e., how to debug if one of the components of the platform is unavailable; b) operational management overhead is more as compared to Public Cloud managed services.

Conclusions

This article presents our experience with designing and building the observability system Pharos for Workday SaaS applications. We also presented Keeper, a lightweight observability system based on Public Cloud managed services to observe Pharos.

Acknowledgements

Several members of the Data Platform & Observability Engineering group of Workday contributed to Pharos and are key to its success.

References

[1] M.Zaharia et al. Spark: Cluster Computing with WorkingSets. In USENIX Conference on Hot Topics in Cloud Computing (HotCloud), 2010.

[2] Paris Carbone et al. Apache Flink: Stream and Batch Processing in a Single Engine, IEEE Data Eng. Bull. 2015.

[3] Jorg et al. Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, USENIX Middleware 2017.

[4] Suman et al. Towards Observability Data Management at Scale. Sigmod 2020.

[5] Jonathan et al. Canopy: An End-to-End Performance Tracing And Analysis System. SoSP 2017.

[6] Colin et al. Monarch: Google’s Planet-Scale In-Memory Time Series Database. VLDB

[7] Apache Kafka. http://kafka.apache.org.

[8] Open Telemetry. https://opentelemetry.io/

[9] Hadoop. https://hadoop.apache.org/

[10] FluentBit. https://fluentbit.io/

[11] Kibana. https://github.com/elastic/kibana

Pharos: The Observability Platform at Workday

Pharos: The Observability Platform at Workday

Keeper: Observing Pharos (the Observability Platform)

Conclusions

Acknowledgements

References

Written by Kishore Pusukuri