Logging at Box: More efficient and more cost effective!

Published in

Box Tech Blog

5 min readMay 31, 2019

Every day millions of enterprise users around the globe rely on Box systems to execute their most critical business workflows. Box requires a powerful and robust infrastructure to support the wide variety of use cases from over 90,000 enterprise customers. To build enterprise-grade systems performing at the guaranteed SLAs, its foremost important for Box engineers to have visibility into their systems and the logs they generate. For the Observability team at Box, this means building the most efficient and robust logging system that offers sophisticated querying capabilities at minimal indexing overhead. Additionally, we need to ensure the cost of the solution does not become prohibitive at Box scale.

Existing Logging System

Our traditional logging system has been based on a self-hosted Splunk solution. Given the scale and organic growth at Box, the system very frequently runs into roadblocks related to indexer capacity, query latency, log retention and ever increasing operational overheads. Additionally, at the given scale, the cost was proving to be prohibitive. The Observability team set out on a mission to provide an alternate logging solution with some strict goals. The new system needed to be robust and reliable, deliver sophisticated querying capability with near real time query experience and most importantly be cost effective at Box scale.

Introducing Arta!

In this blog, we discuss our new Elasticsearch based logging system and the data pipeline associated with it.

We, very dearly, called our new system Arta: Almost-Real-Time-Analytics. [Pronounced ar-ta]. We decided to call it analytics, instead of logging, not only for strategic reasons, eg. logs are just events, but also because the data pipeline was implemented in a way that aligns well with existing analytics ingestion.

Ingestion Pipeline

As shown in the diagram above, while logs are written into files on local disks, a log tailer agent enriches the logs with contextual information and then emits them into Kafka. Kafka consumers provide additional capabilities around governance and lint and forward the logs to the backend. These consumers also read schemas for all logging streams from a schema repository. The schemas are created by the owners of the logging streams and are published into the repository through a self-serve system. The consumer uses these schemas to convert logging streams into batch index operations and sends them to Elasticsearch.

Retention and Durability

It is clear that a significant chunk of logging costs emanate from a variety of retention requirements like compliance or issue remediations. Even though the utility of logs diminishes with time, the cost to retain them for querying doesn’t necessarily reduce. To make the logging solution even more cost effective, the logs are retained in Elasticsearch only for a duration that is enough for debugging recent issues. The logs are additionally sent to a data warehouse where it is more cost effective to store data for long durations and where longer query latencies don’t matter much. Additionally, the added redundancy provides a high durability guarantee. The ingestion pipeline lets us do this while providing a seamless user experience. The same schema is used and is published in one single repository. Service code that is instrumented with analytics APIs throws data into Kafka and from there two separate consumers ingest the logs into two different stores. This allows us to disassociate retention requirements from near real time log aggregation requirements and allows us to build a logging system that is most cost effective.

Operational Logs

The Arta ingestion pipeline requires logs to be structured based on a published schema. To that end, we built an analytics API to provide an easy and efficient way of producing structured logs. This works great for Box owned systems, but any operational or third-party system/libraries where the source could not be modified were still producing largely unstructured logs. To address these third party operational logs, we essentially had two design alternatives. The first was to enhance our log-tailer to transform the unstructured logs into a format supported by the schemas. Although, this seemed like a clean design and allowed for a distributed utilization of CPU resources, we decided against this approach in favor of keeping the fleet-wide log-tailer simple and lightweight.

However, we decided to use a centralized Logstash cluster to transform the logs instead. As shown in the pipeline diagram above, the operational logs are sent to Kafka from where Logstash picks them up, transforms the logs to adhere to a published schema and pushes them back to the pipeline to be ingested into Elasticsearch.

Disaster Recovery

Arta, in many cases, is the source of truth for critical information to identify root-causes and restore our systems to health when issue occurs. Owing to the vast adoption of systems servicing critical customer use-cases, Arta is required to provide high level of guarantees around durability and availability. To achieve the guaranteed SLAs, Observability set up a redundant pipeline in a geographically remote region to protect against disasters. The components that exist within our environment have redundant counterparts in multiple regions, as they are relatively more cost effective and easy to manage. But, we only run a single instance of Elasticsearch. Owing to relatively higher costs associated with Elasticsearch licensing and hosting, redundancy and high availability is sacrificed to keep costs in check. Although, to minimize impact in disaster scenarios, a simple and fast process is designed to switch ingestion to the Kafka consumer reading from the healthy pipeline.

Adoption

Like with any infrastructure system, the gains from Arta could only be realized through adoption across all streams. Even though the benefits from Arta were apparent, the adoption required an effort to migrate unstructured logs to structured logs with schemas. The analytics APIs were designed to minimize the migration effort and use generic schemas whenever possible but some services still required a more focused approach. The Observability team continues to partner with these service owners to carefully look at their consumption patterns from the existing logging solution to design structure for the logs ingested to Arta.

Success Story!

We are now running almost all our prod level developer focused logging streams on Arta, ingesting data in the order of petabytes every month. Arta not only provides a low latency, feature rich and robust logging system, it is able to do this at less than half the cost. This is already producing significant savings in logging costs every year. As we continue to on-board new and existing logging streams to Arta, we are looking into further improvements on stability, resilience, operational efficiency and bringing the costs down even further.