Common Misconceptions About Distributed Logging Systems

For many operators logging represents an immutable record of their application behavior and a necessary ledger for auditing, security, and debugging purposes. As teams have moved towards cloud native development practices they have likely invested in products and tooling like Splunk or ELK as ways of indexing and gaining insight from logs. Heavily regulated industry also have costly requirements for long term storage of these logs creating a valuable niche in the write-once storage market. That said Operators are not always clear on the challenges of various approaches for log aggregation and can have unrealistic expectations on reliability for their platform.

The first and most common misconception is that writing to persistent storage in the container that is running your process can achieve a lossless implementation even when investing in sophisticated scheduling. This is a false premise because the ability to deliver logs from that persistent storage, to a syslog server is not guaranteed. There are two challenges to consider in this regard. First, is that the location of the persistent storage needing collection will need to be managed by a redundant collector of some sort. The second is that the collector needs to either also retain logs in storage (indefinitely) or have a guaranteed connection to the syslog server. These are not impossible problems to solve individually but put together they represent the three tradeoffs outlined by Brewer’s cap theorem.

The second common misconception is that reliable network protocols, plus redundant and highly available components will guarantee delivery. To better learn why this is not totally true, checkout this primer on service level objectives but the constraint here is that this approach can only be as reliable as your underlying network, which, spoiler alert, is not guaranteed (but is highly reliable if you are on a public cloud).

This brings me to my final common misconception among operators and architects. That is that logging should produce back pressure if delivery is not guaranteed. This strikes me as a silly design tradeoff for logs. While you may have specific uses cases that warrant serial processing of events deciding to constrain all development by locking on logging to stdout will show it’s downside in your first popular API that logs in a hot code path.

This is all part of the feedback and synthesis that has lead to the design principles for Loggregator, the logging system for Cloud Foundry. Fittingly “apply no backpressure” is the first design principal of not only Loggregator but also Garden and the platform as a whole. A cloud native developer shouldn’t have to consider (nor are they are capable of considering) how disk i/o would affect the performance of their application.

The second principal is to deliver as many messages as possible. Over the last 9 months Loggregator has moved to use gRPC over UDP to improve this. You can hear more about how we went about measuring this and the improvements we saw from my talk at CF Summit this year.

Our third and final principal is to provide a secure transport. This is still a maturing feature across the platform and major part of making Cloud Foundry enterprise ready. This helps enable operators to meet their security needs even using public cloud providers and SaaS logging tools.

Stay tuned for more details about capacity planning techniques for logging and if you are in Basel next month come check out my talk on Defining Service Level Objectives for Loggregator.