Transitioning Logging and Monitoring Systems at The Economist
At The Economist, transitioning from a monolithic legacy system to a distributed, microservice-based architecture required a fundamental shift in the way we worked as a digital team. While fully engaged in the exciting and challenging process of transforming our technology, our logging and monitoring approach wasn’t able to keep up. We quickly hit the pain points of not translating logging practices from a single, integrated system to a distributed system:
- Logs from different applications have differing schema or no schema at all
- Containerization creates multiple applications running on the same server
- Dynamic host IPs as instances scale or are regenerated
- HTTP requests not easily traceable through distributed applications and services
Stuck in the monolithic approach of simply sending out system logs to an aggregation tool, the usability of our visibility tools quickly degraded. As a result, Economist digital teams were often left unable to determine why a system had failed and thus response and resolution time to business critical issues suffered. The overall effect was a lack of transparency into our systems and diminished confidence in its ability to perform at scale.
After going cross-eyed staring at jumbled logs, it was time for a change. A small, grassroots initiative began in the development groups, working with architects and product owners to understand each team’s current logging situation and biggest pain points. These discussions inspired the creation of a logging standard and format guidelines that would ensure all applications used structure logging. Key requirements include:
- All logs in a structure JSON format with key-value pairs
- Use of Log Levels to control what is sent to aggregators and filter in query tools
- Implementation of a Trace ID, generated at the start of a system event and passed to all subsequent calls and subroutines
- Required fields with aligned naming such as service, msec, host, and message
Additional required fields were documented based on the application type, such as request method and status for client applications. Specific recommendations for implementing the standards in a logging library included:
- Review of existing language specific libraries
- Middleware for log initialization and Trace ID generation
- SHA1/base 32 encoding method for generating a unique Trace ID
- Use of contexts for passing Trace IDs to subroutines and requests
These standards were circulated around the teams until a consensus was reached, at which point, we had a beautiful document.
The Actual Solution
While agreeing on a standard is no insignificant milestone, once it’s been acheived the time comes to dig in and do the work. As this work touched every team in Digital Solutions and was not specific to any one product or project, it was important to coordinate implementation across all engineering teams. An offhand suggestion to put developers in a room with coffee and sugar for a few days quickly evolved into a proposal for a hackathon focused on logging and monitoring.
To expedite the process of applying logging standards and ensure alignment, The Economist hosted a Logging and Monitoring Hackathon with participants from each digital team. The goal was to implement the new logging standards and develop dashboards and alerts using the logging format. The hackathon would ensure each team had a participant who could learn the new standards and share them back with their teams.
Day 1: Kindling
The first day of the hackathon prioritized implementing the logging standards across our platforms. The morning kicked off with an overview on the new standards and recommendations for how logging libraries should be designed. Each team was then prompted to spend roughly an hour on logging design for their system, examining where specific challenges, such as mobile and third party applications, would require more custom solutions. As teams shared their initial design, we identified key collaboration areas, such as data encryption and security and held feisty discussions around centralized vs decentralized formatting. As the day went on, teams focused on their specific implementation challenges including:
- What library implementation would be best provide ubiquitous logging and ease of use for developers
- Abstracting log formatting to be reusable by different applications
- Lightweight solutions for serverless frameworks
As the sun set and the coffee pots emptied, PRs were issued and the seeds of standardized logging were planted.
Day 2: Timber!
With the beginning of standardized logs in place, Day 2 strategized monitoring requirements. The goal was to define key methods for understanding system latency, traffic, errors, and saturation, as per the Google SRE recommendations. Members of the infrastructure team joined to provide feedback and advice on how to configure different types of monitoring tools. Bringing developers and infrastructure together to focus on the big picture was an opportunity for the team to hash out new ideas and debate the merits of different alerts and dashboards. The age old questions of “Will anyone actually look at this dashboard a month from now?” and “Will I get so many email alerts I start to ignore them all?” arose as we discussed methods for understanding overall system health. There were no obvious answers to these questions. Rather, developers and devops worked together on building solutions that balanced visualizations, dashboards, and alerts that aligned to the most critical elements of system stability. For example, an email alert was built to respond to sudden spikes in HTTP error responses while a simple green-red dashboard was recommended for quick visibility of deployment statues.
As the day came to an end it was time for the teams to present their progress. Participants organized a brief presentation on their work and plans for moving the work forward. Pizza and beer were ordered, and the rest of the office was invited to watch and participate with questions or suggestions. By decree of the loudest round of applause, a winner was chosen and crowned in The Economist T-shirts as the Logging and Monitoring monarch.
The work continues. Logging and Monitoring are iterative processes and teams should incorporate logging and monitoring reviews as part regular maintenance and performance work. This is critical for ensuring that the relevant information is recorded and analyzed. At the end of the hackathon, each team left with a strategy for their logging and monitoring and stories in their backlog or defined next steps to move the work forward. Participants also scheduled times to present their work back to their teams. So while the work is never done, the hackathon laid the groundwork and built a shared vision for where our logging and monitoring can grow. More importantly, it brought developers together to share ideas, learn from one another, and maybe, just maybe, fall in love with standardized logging.