Co-Authored with: Vipul Sodha
A scalable and feature-rich log management and exploration platform is imperative for the successful operations of any medium to large backend engineering team. It is one of the most important, but often neglected tool for infrastructure observability. There are a plethora of solutions, ranging from cloud provider-managed services, independent Software-as-a-Service (SaaS) vendors, as well as open source software which can be self-hosted on your infrastructure.
However, as with most engineering activities, it would be an oversight to consider this as a simple selection problem based on just a couple of factors, especially because none of the solutions are “perfect”. We should do our best to classify the solutions based on their appropriateness for the requirements at hand. As we figured out, there are various intricacies which, when accounted for, would help us to pick a solution that most closely matches our set of trade-offs.
In this article, we will explore the set of challenges and considerations that we dealt with while solving the problem of log exploration at Carousell, the solution we adopted, and the various advantages and disadvantages we encountered.
A sample of the volume of logs to give an idea of the scale of the challenge
Being an internal tool, the primary consumers of a log platform are engineers themselves. The best way to identify the requirements was to discuss with the engineers who would be using the platform. This was a relatively smooth process, as engineers value the importance of providing clear requirements. Here are the requirements distilled from the problems highlighted by our engineers:
Centralised access to all logs in a microservices environment
Observability is one of the main challenges with microservices. Having a way to centrally access logs from all services is invaluable when engineers need to pinpoint the root cause of complex bugs. The fundamental requirement of the log platform being built was to facilitate a centralised point of access for logs across all services.
Searching and filtering
What’s the use of mountains of data if it can’t be retrieved and consumed in a meaningful way? The platform should support querying capabilities, with “field-level” filters as well as full-text search of log payloads. It should also provide basic aggregation and faceting on selected filters to provide quantitative insights.
Defining and parsing known log formats
Logs should be parsable in order to derive useful fields from the text payload. For instance, the log level (DEBUG, INFO, ERROR), timestamp, and log origin (service name, instance, etc.) are some of the fields that would be important for a functional log exploration experience.
Live tailing of logs
An interesting requirement which was highlighted during the discussions was the ability to live “tail” logs in the central dashboard. This would provide a real-time view of the logs as they stream in, with all the necessary filters applied, resulting in powerful debugging capabilities.
Control ingestion volume to be cost effective
A major problem of the log-processing solutions that were previously adopted at Carousell was cost. Log payloads can be substantial due to the size of individual logs as well as their sheer quantity. Needless to say, processing such large amounts of logs requires proportional compute resources — CPU, RAM, disks, network bandwidth — all of which directly impacts infrastructure cost. This approach is inefficient when the majority of logs will never be consumed.
A fundamental requirement of the log platform was to provide controls for log ingestion. Engineers should be able to selectively enable the ingestion of logs at a particular level of any service. This will ensure that logs are ingested only as required.
Avoid loss of log data during ingestion
Log consistency (completeness and order) is of paramount importance. Lossy logs would cause unimaginable frustration for engineers relying on the platform to debug production issues. This will reduce the trust and adoption of the solution. A key requirement was to solve for log consistency.
Minimal onboarding effort
At Carousell, we strive to ensure that product engineers can focus on creating value for users without having to worry too much about the underlying tech stack. In order to ensure minimal disruption to engineers’ productivity, we had to ensure that onboarding to the log platform was a smooth and effortless experience. An ideal scenario would be for product engineers to magically get access logs without having to manually create a new account.
We chose to go ahead with a cloud provider-managed log dashboard as it closely matched our requirements for filtering, searching and aggregation. After much deliberation, the following solution was closest to our requirements.
- Build a custom log collection agent attached as a sidecar container to each service instance
- Have a log service which consumes all log messages and enqueues them to a Kafka deployment
- Consume logs from the Kafka cluster, parse them and apply filtering rules
- Finally, ingest the processed logs into Google Cloud Logging
Log Service Agent
The log service agent performs the core task of extracting logs and forwarding them to the log service. The agent runs as a sidecar on all running pods in our microservices, and shares the same mounted storage volume for logs with the main application. Doing it this way allows the log agent to be independently managed, and there is no dependency between the main application pod and the log agent sidecar.
The log service agent uses minimal CPU (30ms) and memory, which keeps our overall infrastructure costs manageable.
To understand how a log service agent works, let’s first understand how application code writes logs. All the services use a common framework which provides application code with APIs to write logs to disk in individual log files.
Log files are rotated by a log daemon running on the application pod once the file size reaches a configured limit. If the log volume for any service is huge, the files are rotated very quickly, which requires the log agent to keep track of the files being rotated and the quantity of logs read from each file.
While application code keeps writing logs to the log files, the log agent in the sidecar continuously reads the log files and streams logs to the log service.
Since we are running a log agent in a sidecar, it acts like a plug and play service that we can attach or detach whenever required.
Log Service Agent Components
- A continuously-running goroutine that scans for any log file rotation and starts a new reader for every new file
- It also reads the initial state of each file from the state store to understand where to start reading the log file, especially if there was failure previously
- Responsible for reading the assigned log files and sending logs to the buffer
- Stops when the end-of-file (EOF) is reached
- Accepts logs from all readers
- Waits for the buffer to be full or at a configured time to flush logs to the publisher
- Responsible for streaming the logs to the defined store or API
- Responsible for maintaining the checkpoints of which point a log file is read until
Go Channel between Reader and buffer
- A buffered channel between buffer and readers
- All the readers that read logs from files push the logs to a buffered channel that is consumed by buffer
- The channel acts like a queue between reader and buffer
- It also creates back pressure in case the buffer overflows and stops all readers until the buffer starts consuming the channel again
In our current design, all the publishers in the log agents publish log messages in bulk to the log service.
- A gRPC sink for logs exported by log agent
- Publishes logs to Kafka with minimal processing
- Does not parse logs
- Manages logging configurations and backend for the log dashboard UI used for configuration
- Provides an interface for log workers to read configurations
Log service acts as a mediator between log agents and Kafka. We did not want to connect our log agents directly to Kafka since we have more than ~2,000 pods running in production, which scales in and out as required, and creating so many connections to Kafka can become a problem in the long term. So all the log agents call log service with log messages using gRPC and log service forwards them to Kafka. This way, we decouple the complexity of sending messages to Kafka away from log agents.
Workers are Kafka consumers for decoupled log processing. Workers are responsible for understanding the log configuration for each service and deciding whether to ingest the incoming logs. Workers parse the logs to extract metadata like log level, service name, etc. Workers are at the end of the logging pipeline, ingesting logs into the log storage.
This setup allowed us to meet our requirements to have multiple workers running at the same time, and writing to various storage systems.
Log control Dashboard
The dashboard is a minimal user interface that allows engineers to configure the ingestion of logs, like the log level and sampling rate.
Services are initialised by default with ERROR logs to capture known errors and exceptions, and when things go wrong or when further troubleshooting is needed, the dashboard can be used to change the log level to DEBUG to ingest more detailed logs. This allows us to store logs that are meaningful for the current state of our services, and helps us keep our logging costs under control.
Onboarding of services to the platform
Since the log agent is the only component running in service pods, the rest of the logging infrastructure can be independently managed by our platform team.
Since the log agent was deployed as a sidecar, the deployment required no effort from engineers while onboarding their services. In fact, the entire rollout was transparent and engineers were able to use the platform immediately.
Issues faced during rollout
While we did not face any major incidents during the rollout, we did have to tackle a few unforeseen challenges.
Unintentional log ingestion
Since we allowed ingestion of all log levels, including logs which could not be parsed, the log volume was beyond our initial calculations. We had to quickly push a fix to disable the ingestion of unparsable logs, which brought the volume under control.
High CPU usage with Go tickers
We faced a small issue when we missed closing some time.tickers that were used in code, which caused the CPU usage to slowly go up over time. We quickly found the issue and fixed it, and how we debugged it deserves a separate blog post which we will publish in future.
The logs ingested into Google Cloud Logging are currently configured with a short retention period. We are also exploring options to have an alternate and more affordable log storage system with long term retention regardless of the log level, which can be extracted and queried on an ad-hoc basis.
Log handling at scale is a formidable challenge. We took on the daunting task of building an in-house framework while relying on some managed solutions when we saw the best fit. The log platform has proved effective in providing Carousell engineers with a one-stop solution for all their log access needs. Most importantly, it has provided us with an invaluable opportunity to learn and work on core engineering problems, which ultimately enables us to deliver more value to our users.
When your code ditches you, logs become your best friend!