Get Your Ducks in a Row: Analyzing AWS VPC flow logs with DuckDB
AWS VPC flow logs provide a lot of value for troubleshooting network setup, analyzing traffic patterns, bottlenecks and for security analytics. At a high level these can provide valuable insights for Ops, SRE, Security, FinOps teams. Unfortunately enabling flow logs can sometimes be a costly proposition. Let’s look at some use-cases where we can leverage these without breaking the bank
Flow Logs
The cheapest option is to setup VPC flow logs logging to S3 bucket. Logging to CloudWatch Logs can cost more than 20 times. Scanning the data with CloudWatch Insights will cost even more. What is the point of logging if one can’t query or get insights :). I usually avoid CloudWatch unless that is the only logging option (Lambda’s, etc.) Once setup, logging to S3 bucket will show up in the bill under CloudWatch > [Region]-S3-Egress-Bytes. Depending on the damage, one can figure out whether to keep it or turn it on/off when necessary.
Settings I recommend:
- Filter: All
- Maximum aggregation interval: 10 minutes
- Destination: S3
- Log record format: Custom
- Log file format: Parquet
- Hive-compatible S3 prefix: Enable
- Partition logs by time: Every 1 hour
Let’s dwell into the Log format fields that are useful. ecs-* fields can be included if ECS service is used
- start, end
- action, log-status, protocol, flow-direction, traffic-path
- account-id, region, az-id, instance-id, interface-id, subnet-id
- bytes, packets, tcp-flags
- srcaddr, srcport, dstaddr, dstport, pkt-srcadd, pkt-dstaddr
- pkt-dst-aws-service, pkt-src-aws-service
Setup
In my last gig, I did VPC analysis using ClickHouse. Anything that leverages columnar storage format is ideal for VPC flow logs because it compresses nicely. This is where parquet comes into play. AWS supports saving logs in parquet and DuckDB can directly consume these files. First, let’s get some data that will help later in the analysis. Assuming EC2 instances are tagged, identify the tag key that groups the instances. For example: If Nginx is used as a reverse proxy, one might have them tagged as role: proxy
. Get the instance Ids, private IP address and the value of the role tag key. I do something like the following to capture this information in a TSV file:
for i in us-east-1 us-west-2 <all regions with instances>; do
aws ec2 describe-instances \
--filters Name=vpc-id,Values=vpc-1234,vpc-abcd,<all vpcs> \
--query 'Reservations[*].Instances[*].{instance:InstanceId,ip:PrivateIpAddress,role:Tags[?Key==`role`]|[0].Value}' \
--output text \
--region $i
done > instances.tsv
In the instances.tsv file, add a header line (separated by tabs) so that DuckDB can use those as column names:
instance\tip\trole
Fire up an EC2 instance (I use 8 vCPUs / 64 GB) in the same region where an S3 bucket with VPC flow logs exists. Make sure to attach an instance profile role that allows the instance to read data from flow log bucket. Use a bigger EBS volume (100 GB) depending on how much flow log data is being ingested. Install DuckDB on the instance. Copy instances.tsv to the EC2 instance. Fire up duckdb CLI with a DB name and run the following SQL (replace the paths, region, year, month, day and other details as appropriate)
PRAGMA max_temp_directory_size='100GiB';
PRAGMA temp_directory='/home/ec2-user/tmp';
INSTALL httpfs;
LOAD httpfs;
CREATE SECRET s3 (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN,
CHAIN 'instance',
REGION 'us-west-2'
);
CREATE TABLE flowlogs AS (
SELECT *
FROM read_parquet('s3://my-vpcflowlogs-bucket/prefix/AWSLogs/aws-account-id=123456789012/aws-service=vpcflowlogs/aws-region=us-west-2/*/*/*/*.parquet', hive_partitioning = true)
WHERE year = 2024 AND month = 05 AND day = 31);
CREATE TABLE instances AS (
SELECT *
FROM read_csv('/home/ec2-user/instances.csv'));
CREATE VIEW logs AS (
SELECT f.*, i.role AS srcrole, j.role AS dstrole
FROM flowlogs f
LEFT OUTER JOIN instances i
ON f.instance_id = i.instance
LEFT OUTER JOIN instances j
ON f.pkt_dstaddr = j.ip);
CHECKPOINT;
Once this is done, the DuckDB database file can be copied from the instance for offline analysis and the EC2 instance can be terminated. The DB now has a view called logs
that is enriched with the role
information. Let’s examine some use-cases.
Analyze
Right sizing Security Group outbound rules to the Internet
SELECT DISTINCT srcrole, dstport, protocol
FROM logs
WHERE log_status = 'OK'
AND action = 'ACCEPT'
AND flow_direction = 'egress'
AND dstaddr NOT LIKE '10.%'
AND dstport < 32768
ORDER BY srcrole, dstport;
This will need to be tweaked to exclude VPC CIDRs (in my case 10/8) and ephemeral port range.
Which roles are sending most traffic to S3?
SELECT SUM(bytes) AS total, srcrole
FROM logs
WHERE log_status = 'OK'
AND action = 'ACCEPT'
AND pkt_dst_aws_service = 'S3'
GROUP BY srcrole
ORDER BY total DESC;
Intra region VPC peering traffic across roles
SELECT SIM(bytes) AS total, srcrole, dstrole
FROM logs
WHERE log_status = 'OK'
AND action = 'ACCEPT'
AND traffic_path = 4
GROUP BY srcrole, dstrole
ORDER BY total DESC;
For inter region VPC peering, change traffic_path to 5. See explanation here
Exercise to readers
I can come up with lots of use-cases, but most of these can be specific to the workloads and AWS environment. Same SQL might not work in all the environments.
- Cross-AZ traffic. If we add az-id in instance.tsv, we can have source and destination AZs
- Roles abusing NAT Gateway :)
- Traffic hot-spots: Specific services getting overwhelmed
- Traffic from/to know bad IP addresses (malware, C2, etc.)
- NACLs right-sizing for intranet and Internet
- Unused resources: No traffic to/from a resource (cost saving)
- Unnecessary traffic: NTP traffic to Internet (can leverage AWS NTP services), DNS requests to unexpected resolvers, etc.
AWS improvements
AWS can significantly improve the VPC flow log data. pkt-src-aws-service and pkt-dst-aws-service are restricted to a handful of AWS services. Not sure why they don’t tag every ENI with something like aws:service
and include that in the logs. One has to comb through the description and other fields to figure out if the ENI is associated with a Lambda, NAT Gateway or something else!
It would be great, if customers can select one or more ENI tags to include in the logs. This will avoid the joins we did above.
Cloud Security
None of the Cloud Security, CNAPP vendors have anything for VPC flow logs. Couple of years back, we did this at Ariksa, but I haven’t come across any decent vendor doing this! EDR vendors have some of this data, but only for Cloud VMs and not for other services like Lambda’s, etc. Some cloud cost vendors (Vantage, are there others?) seem to be solving the cost specific use-cases!
DuckDB
DuckDB was ideal for this analysis. I was playing with 1.1 billion flow logs and almost all queries aggregations were returned in seconds. It has an inet extension which I didn’t use, but possibly can come in handy. Although, it doesn’t look like it has Is IP in CIDR functionality!
Conclusion
This post was born out of real world use-cases. VPC flow logs helped me identify and fix a few issues in our setup related to Ops, SRE, Cost, etc. This setup cost less than a dollar to create the DB. Once the DB was created, I was able to analyze offline on my laptop without incurring additional costs.