Rapid Troubleshooting of the Cloud Foundry Logging System

Adam Hevenor
2 min readFeb 15, 2018

--

Last year when the Google SRE Team performed their application readiness review of the Cloud Foundry platform they identified two Frequent and Damaging issues related to the log transport system Loggregator. Since then we have developed tooling, improved our documentation, and made product improvements that make Loggregator more reliable and easier to operate.

These are the results of this feedback.

⭐ Metric — Reliability ⭐

The star metric for the Loggregator system is how reliable the transport of the logs and Operators should focus their SLO’s around this metric. We recommend Operators set a Service Level Objective of 99%. In fact, our research has shown that most teams don’t notice when they dip below this slightly by they do notice dips below 90%.

That said with proper scaling and monitoring Operators can easily achieve the 99% reliability target and beyond by following these two steps. (Read my previous post about Defining Service Level Objectives for Loggregator).

📈 Scaling Indicators 📈

The most common reasons that Loggregator has reliability issues is that it is not properly scaled. The fastest way to check your log reliability on any distribution of Cloud Foundry is to use the Open source logmon application. If you are using the Pivotal Cloud Foundry distribution this is another reason to upgrade to use the PCF Healthwatch, which includes the key scaling indicators for Loggregator. Open source operators that want a more in-depth analysis of capacity planning can read the Loggregator Operator Handbook.

🔊 Noisy Neighbors 🔊

Another common cause of reliability issues with transport reliability is a “noisy” application producing logs and drowning out the other applications on the platform. To help reduce the Mean Time to Discover (MTTD) noisy neighbors we created the noisy-neighbor-nozzle.

This nozzle reads and counts all the application logs on the platform and comes with a handy CLI plugin for quickly getting a list of the applications producing the most logs.

Sample noisy-neighbor assessment using the CLI tool

🐊 Conclusion 🐊

These two scenarios cover nearly all log reliability problems we have seen over the last year. Hopefully these tools will help Operators quickly and easily troubleshoot their system. If you have questions you can reach out to us on the #loggregator slack channel.

--

--