Building a comprehensive and cost-efficient network observability platform using Grafana Loki

Published in

The Qonto Way

8 min readJun 15, 2023

The struggle of troubleshooting network traffic on AWS

Buzz, buzz, buzz… Your pocket starts vibrating: an incident has occurred, and you’ve been paged to fix it! Feeling like a superhero, you start investigating. It takes you a second to figure out that the issue is with application A (so easy).

The application’s logs show the following error when trying to reach application B: "connection timeout". On application B's side, everything is running smoothly.

All signs point to a network issue; most likely a change on a security group led to network traffic being blocked. No pull requests on the terraform code? Did someone change something manually? You can’t figure out what’s actually blocking the traffic and you start wondering if it’s not DNS or a routing issue.

Instead of wandering around your network infrastructure in despair, you decide to analyze AWS VPC Flow logs following this AWS guide.

However, your excitement soon turns to frustration as you realize the complex steps involved: deploying an AWS Glue database with partitioned tables, setting up an Athena Workgroup using CloudFormation, and manipulating data with Athena using pseudo-SQL queries. Each query takes an agonizingly long time to execute — you’re frequently required to repartition the data — and you find yourself tediously copying, pasting, and rewriting S3 bucket keys.

You break into a cold sweat, and suddenly networking isn’t fun anymore.

Building our dream network observability platform

At Qonto, we manage significant amounts of logs. They may be produced by business applications, core infrastructure systems, or even managed services.

When I joined Qonto, we were exclusively relying on Elasticsearch to ingest, store, and search logs. While it’s a great solution, we realized it wasn’t the right fit (mostly for cost-efficiency reasons, but more on that later) for the following use case: providing SRE-grade observability and alerting on network logs. Specifically, we were dealing with a staggering volume of approximately 40,000 records per second generated by AWS VPC Flow Logs.

As the SRE team, we spend a considerable amount of time debugging network systems (firewalling, routing, etc.) The ability to swiftly search network traffic and issue alerts plays a vital role in minimizing downtime. Additionally, performing thorough network analysis by exploring logs spanning any time period is ultimately necessary to conduct valuable post-mortems.

As a result, sampling or filtering logs was not a viable option. Our objective was to achieve complete observability of all network traffic going around in our networks by ingesting those 40,000 records per second (~1 TB / day).

This article details how we built a network observability platform in a cost-efficient way, leveraging Loki, Prometheus, Grafana, and Lambda Promtail.

Managing large amounts of infrequently-accessed data

Network logs are often high-throughput log streams (dozens of terabytes each month in our case) when all traffic is recorded.

That being said, network logs are only accessed when we need to troubleshoot an issue happening with an application (the most common symptom being a connection timeout error message).

As a result, we end up with a huge amount of infrequently accessed logs.

It implies that, to build a cost-efficient platform, our focus should be directed towards optimizing storage and ingestion costs, rather than query costs.

Selecting the appropriate datastore: Loki vs. Elasticsearch

Loki and Elasticsearch were our two main candidates. After conducting in-depth Value Engineering, we decided to use Grafana Loki to ingest all VPC Flow logs we produce, instead of using our existing Elasticsearch deployments.

Here are the main conclusions that guided our decision:

The cost of storing significant amounts of data using object storage vs. block storage. Loki can leverage S3 to store data chunks and indexes, whereas Elasticsearch uses local file storage (EBS). In our case, in the eu-west-3 AWS region (Paris), it’s currently 4 to 5 times cheaper to store data on AWS S3 rather than on EBS, making it a no-brainer for this use case.
Note: data transfer costs within AWS are negligible in our case because we use VPC endpoints.
Loki consumes fewer resources when ingesting data but a lot more at query time. This is due to the difference in indexing behaviors between the two solutions and the cost of using Loki parsers at query time.

Because network logs are high-volume & infrequently accessed, we decided Loki was the best candidate.

Note: We quickly discarded using CloudWatch Logs because of its daunting ingestion pricing of $0.59/GB of data at the time of writing this article.

Solution overview

Here is a high-level overview of our setup:

We deployed Loki in microservice mode using the loki-distributed Helm chart.
We deployed the VPC Flow Logs, the S3 buckets, the Lambda function, and the SQS queues using Terraform.

Ingesting VPC flow logs in Loki using Lambda Promtail

Lambda Promtail is a subproject of Loki that we actively contribute to. It’s a utility tool enabling the ingestion of logs produced by AWS services such as CloudFront, Elastic Load Balancers, and, of course, VPC Flow Logs.

In our case, the Lambda Promtail function retrieves S3 Notification events from an SQS queue when a new VPC Flow Log file is created in a centralized S3 flow-log-bucket. It then downloads, parses the file, and streams its content to Loki for ingestion.

Don’t lose a single log record

When there’s an error in the ingestion flow, Lambda will send the S3 event to a Dead-Letter Queue (DLQ), which can later be reprocessed using AWS SQS’s DLQ redrive feature, sending the messages back to the main queue to be picked up again by the Lambda function.

Loki itself is a very resilient system leveraging all the power of Kubernetes. Still, things can always go wrong, and this setup has helped us recover from incidents without losing a single log record on several occasions.

In fact, we are quite proud of this setup since we implemented the SQS feature ourselves in this PR.

Note: We’re currently working on extending Lambda Promtail’s capabilities to ingest logs from other AWS services. We have successfully integrated CloudTrail logs ingestion (resulting in a significant enhancement to our IAM debugging experience) and Cloudfront log ingestion in batch mode.

Visualizing flow logs using Grafana: make networking fun again

LogQL is a powerful query language for logs stored in Loki. Since its syntax is very similar to the Prometheus’ PromQL syntax we use extensively, we felt at home very quickly.

After adding Loki as a datasource in Grafana, we were ready to explore our VPC Flow Logs.

Coming back to our previous example with Application A being unable to reach Application B, let’s now suppose we want to inspect rejected traffic from A to B.

The following query retrieves all REJECTED network traffic with a source IP from the 10.0.0.0/8 CIDR range (where application A is) and a destination IP of "10.36.81.82" (Application B’s IP):

{__aws_log_type="s3_vpc_flow", __aws_s3_vpc_flow_owner="<aws-account-id>"}  
|= "REJECT" 
| pattern `<acc> <vpc> <eni> <srcIP> <destIP> <sPort> <dPort> <protocol> <pkt> <bytes> <start> <end> <action> <pktsrc> <pktdest> <tcpflag>` 
| srcIP = ip("10.0.0.0/8") 
| destIP= "10.36.81.82" 
| line_format `srcIP: {{.srcIP}} destIP: {{.destIP}} srcPort: {{.sPort}} destPort: {{.dPort}} action: {{.action}} protocol {{.protocol}} tcpflag: {{.tcpflag}}`

The pattern expression parses the log records at query time, adding extra labels that can later be used to create a more fine-grained query.

The line_format expression rewrites the log line displayed in the query result.

And voilà:

Oops! Looks like there’s an issue reaching Application B on IP 10.36.81.82 and port 30060.

Leveraging the flexibility of LogQL, we can easily build queries to detect blocked traffic to specific TCP ports or IP ranges, across hundreds of gigabytes, in only a few seconds.

At last, investigating networking issues is (almost) fun again!

Alerting with metrics generated from logs

At Qonto, we use Prometheus and AlertManager to manage metrics and alerts. Loki’s integration with Prometheus enables metrics creation from log records using Loki Recording Rules.

Regarding VPC Flow Logs, we created custom metrics to improve observability & alerting, and here are a few of them:

Rejected traffic count over the last minute:

# Name of the Prometheus metric created
- record: loki:per_aws_account:rejectedflowlog:count1m
# LogQL query
  expr: sum by (__aws_s3_vpc_flow_owner) (count_over_time({__aws_log_type="s3_vpc_flow"}[1m] |= `REJECT` | pattern `<_> <_> <_> <srcIP> <_> <_> <_> <_> <_> <_> <_> <_> <_> <_> <_> <tflag>`| srcIP = ip("10.0.0.0/8")))
# Extra labels
  labels:
    team: sre

We use this metric to create an alert based on a PrometheusRule to detect blocked traffic and intervene as quickly as possible.

Get the top 20 destination IPs with the largest network throughput:

- record: loki:top20dstip:egressbytes:sum1m
  expr:  topk(20, sum by (__aws_s3_vpc_flow_owner, destIP) (sum_over_time({__aws_log_type="s3_vpc_flow" ,__aws_s3_vpc_flow_owner="<account-id>"} |= "ACCEPT" | pattern `<_> <_> <_> <_> <destIP> <_> <_> <_> <_> <bytes> <_> <_> <_> <_> <_> <_>` | destIP != ip("10.0.0.0/8") | unwrap bytes[1m]) > 1000))
  labels:
    team: sre

This metric proved useful in reducing our data transfer costs on AWS. It helped us detect that a considerable amount of data was being transferred to external services. We were then able to optimize those flows: it turned out compression was not enabled on some data transfers to external SaaS services, and VPC Endpoints for AWS ECR were missing.

Conclusion

By leveraging Loki, Lambda Promtail, Prometheus, and Grafana, we built a cost-efficient network observability platform at Qonto. The ability to ingest and query VPC Flow Logs in near real-time has significantly improved our troubleshooting capabilities and reduced incident resolution times.

We were originally concerned about the cost implications and the strain it would put on our existing stack to ingest such a high volume of logs. However, introducing Loki has provided us with new opportunities for efficient ingestion, enabling us to address the gaps in our observability. Now, we can ingest all logs from AWS and other sources, unlocking comprehensive visibility into our systems.

We’ve significantly strengthened our network monitoring capabilities by leveraging custom metrics derived from log records and by implementing appropriate alerting rules. Encouraged by the success of this approach, we plan on expanding its application to numerous other use cases.

We hope this article has provided insights into building your next network observability platform using open-source tools.

Happy logging and monitoring!

Qonto is a finance solution designed for SMEs and freelancers founded in 2016 by Steve Anavi and Alexandre Prot. Since our launch in July 2017, Qonto has made business financing easy for more than 350,000 companies.

Business owners save time thanks to Qonto’s streamlined account set-up, an intuitive day-to-day user experience with unlimited transaction history, accounting exports, and a practical expense management feature.

They stay in control while being able to give their teams more autonomy via real-time notifications and a user-rights management system.

They benefit from improved cash-flow visibility by means of smart dashboards, transaction auto-tagging, and cash-flow monitoring tools.

They also enjoy stellar customer support at a fair and transparent price.

Interested in joining a challenging and game-changing company? Consult our job offers!