How we reduced our log usage by 75% without losing visibility of our applications

Published in

nullplatform

4 min readJan 2, 2024

If you’ve ever found logs to be an urgent topic of discussion in your team, you’re not alone. In my previous role as CTO for one of LATAM’s largest e-commerce platforms, logs became a primary topic of conversation when one summer, our Datadog bill accounted for 35% of our cloud expenses! If this sounds familiar, here’s a spoiler: there’s a solution, and it’s simpler than it seems.

In application development, diagnostic tools for operational issues are crucial. Among these tools, “logs” — records of what’s happening in an application, like transaction status or process activity — are fundamental. However, determining when and how to log information can quickly become complex.

This leads to a big question: why is effective logging so challenging?

Logging is akin to predicting the future. We need to record information to understand various situations, both expected and unexpected, in unfamiliar contexts, often for unpredictable traffic volumes. For example, on a payment platform, we need to diagnose potential issues, whether internal or third-party, in a high-transaction environment, often distributed across multiple servers. This complexity can lead to information volumes surpassing the transaction data itself, making processing complex and costs skyrocket, especially with software handling numerous transactions.

To understand how companies tackle these issues, let’s look at typical log architectures (there are endless variants, but we’ll outline the most common) and a different approach to solving frequent problems.

Disk Logs

The simplest method is to log everything we need and store it on the server’s disk. While economical and straightforward, this method is highly ineffective for leveraging information. Accessing logs requires server access, posing a security issue. With multiple servers, manual aggregation is tedious. In dynamic cloud environments, where servers are quickly created and destroyed, this method becomes even more ineffective.

Log Aggregators

A more efficient, widespread method is using log aggregators and centralized storage services like Datadog and CloudWatch Logs. Developers focus on logging, while DevOps teams choose where to aggregate logs.

However, this can be deceptive. As log volumes increase, so do costs, often exceeding computing costs. This creates tension between DevOps, concerned about costs, and developers, focused on application functionality. Restrictions on logging can lead to important information being missed during troubleshooting. While useful, this architecture can become costly and unmanageable.

Dynamic Local Pipelines

At nullplatform, we innovated with a solution we term “Dynamic Local Pipelines,” aimed at resolving the log management dilemma without imposing the usual trade-offs between necessary logging and cost control. This innovative architecture pivots around a ‘log sidecar’ component and a SaaS controlplane.

Here’s how it works:

Localized Log Management: Each server, Kubernetes cluster, or serverless function is equipped with its log sidecar. This sidecar acts as a local log manager, dynamically configuring and controlling the logs generated at its specific location.
Dynamic Configuration: What sets this apart is the ability to dynamically modify log settings by independent service (pod, function, container, etc). Offering flexibility and adaptability in real-time.
Developer-DevOps Collaboration: This architecture fosters a collaborative environment between developers and DevOps teams. Developers can embed comprehensive logs into the application, ensuring detailed insights for troubleshooting. Simultaneously, DevOps teams can enforce overarching logging policies that align with budgetary and operational constraints.
Cost-Effective and Efficient: By controlling logs at their source, this approach significantly reduces unnecessary data transmission, storage, and processing. It leads to substantial savings in terms of storage and network costs, as well as efficiency in log processing and management.
Temporary Overrides: A unique feature is the ability for developers to temporarily override default logging policies during critical debugging or troubleshooting periods. This flexibility ensures that, when necessary, detailed logs are available without permanently increasing costs.

Through “Dynamic Local Pipelines,” we at nullplatform achieved a remarkable 75% reduction in log consumption, maintaining deep visibility into our applications without the burden of excessive costs.

In Conclusion

The implementation of “Dynamic Local Pipelines” approach fundamentally changes the game in log management by:

Empowering Developers: Developers are no longer constrained by rigid logging policies. They have the freedom to embed as much detail as necessary into their application logs, ensuring that critical information is always available when needed.
Cost Control: Despite the increased potential for detailed logging, overall costs are kept in check. The dynamic nature of the policy enforcement ensures that log verbosity increases only when absolutely necessary.
Seamless Integration: This solution seamlessly integrates into existing environments without necessitating changes in the application code or deployment strategies. It is adaptable, scalable, and non-intrusive.
Balancing Act: The real achievement of this system is in striking the perfect balance between operational visibility and cost-effectiveness. It demonstrates that it’s possible to have detailed logging without the associated high costs.

For more information or to explore how this solution can be applied to your organization, don’t hesitate to reach out to me (gabriel@nullplatform.io).

How we reduced our log usage by 75% without losing visibility of our applications

Disk Logs

Log Aggregators

Dynamic Local Pipelines

In Conclusion

Written by Gabriel Eisbruch