Analyzing AWS VPC Flow logs

Mikuláš
4 min readJul 28, 2017

--

Take those simple steps to reduced your AWS Bill by half! Well, it worked for us.

You open an AWS Bill and one entry surprises you:

Searching for the “data transfer out beyond the global free tier” yields no actionable results. Even if you open CloudWatch and find what instance had a lot of traffic, there is not much you can do about it retroactively. Unless you are extremely lucky, the traffic (and the bill) will repeat next month.

CloudWatch > Metrics > Per-Instance Metrics > search for NetworkOut > chart all. Make sure to set Statistic to Sum, the default is Average which is not very helpful in this case!

What you have to do immediately is turning on VPC Flow Logs: a log of almost all traffic, including source and destination address and port. Go to VPC > Your VPCs > select a VPC you want to monitor > switch to Flow Logs tab > Create Flow Log. The logs are then saved into CloudWatch Log Group. Wait for the data to gather (again, it does not get populated retroactively, it starts logging from the time you set it up).

To process the Flow Log, you need to export the data; I’ve chosen to download the data and import them into PostgreSQL. In the CloudWatch dashboard, select Logs and with the Flow Log Log Group selected, click Action > Export all data to Amazon S3. There seems to be no notification when this is completed, so wait a reasonable amount of time and then download the data.

Most fields are optional. I’ve added an extra crosszone and external fields which are not in the Flow Log, but we will populate them from the other fields.
Import the Flow Log files into PostgreSQL table. The NULL option is especially important!

Unfortunately this data is too granular and again, hard to act on. Since our goals is finding what traffic contributes to the data-out cost, we will find what rows are between availability zones.

List the subnets in your VPC and note the CIDRs.

If the srcaddr and dstaddr do not correspond to the same subnet CIDR, it’s a cross-zone traffic. Those rows count to the high Bill we got. I’ve also included an external field, which is true if the traffic is not in your VPC.

Your subnet CIDRs may be different. Update accordingly.

If you are going to explore the data beyond the queries here, I suggest creating proper indexes. Our 14 days of traffic logs are roughly 20 GBs, so the queries can take many minutes.

Unfortunately, if the traffic is between two interfaces under the VPC you monitor, there will be two rows for the same packets. You could try to map those together with a start/end+ports+packets+bytes but I didn’t bother.

This block is a sanity check. You can skip it if you believe I made no mistake so far, but I suggest you verify.

A simple SELECT Sum(bytes) FROM vpc WHERE crosszone=1 should return from 100% to 200% of your current data-out bill. If none of the traffic was between your interfaces, it would be equal; if all of the traffic was inside your VPC, it would be logged exactly twice. This may tell you what is the ratio of internal vs external traffic.

This query yielded 5.6 TB for me, but the bill shows 3.7 TB. This tells me there is 1.9 TB of traffic between our instances. This traffic is logged twice, for a total of 3.8 TB. This leaves 1.8 TB of external traffic.

Running SELECT external, Sum(bytes) FROM vpc WHERE crosszone=1 GROUP BY external returns 4.4 TB of internal (doubled) and 1.2 TB of external traffic. I don’t understand where the discrepancy came from.

This overview is interesting, but again, not immediately actionable.

At this point, we are ready to find out what causes the traffic. Lets list our most used application ports so we can query those later:

This query lists recipients of cross-zone traffic inside our VPC.

I was delighted to find there are 2 TB of traffic associated with a single recipient IP 172.31.29.150, which incidentally corresponds to a running EC2 instance. Note that if the instance was already removed, we wouldn’t be able to associate the IP with any service; for this reason, I think AWS should log all IP allocations and also provide a easy overview of all currently allocated IPs.

Next I would like to know where the traffic to this single offender originates:

Ok so there is definitely a problem here. Port 3306 indicates that source is MySQL.

I was fairly confident 172.31.2.59 is a MySQL RDS instance, but to be sure I verified with a DNS query with the RDS endpointdig a mysql-foo.1knj23n.eu-central-1.rds.amazonaws.com. This did indeed return the IP I was expecting.

At this point we know the original Traffic Bill of 3.7 TB which costs us $333 includes a 1.9 TB cross-zone traffic from MySQL to our instance. This is pretty good information: I know it’s worthwhile to either optimize database access (there may be an app pushing ridiculous amounts of data), or consolidate everything into a single availability zone (with failovers in other zones).

Useful queries

Presented mostly without commentary.

This query will list your application ports and bytes going to external IPs (users). You will very likely see ports 443, 80 leading and then a bunch of random ports with very low amounts traffic.

Check if there are no outlying external recipients:

Disclaimer: reproducing those steps will incur multiple additional charges from Amazon. You will pay for the Flow Log as for any CloudWatch. Exporting the logs to S3 and downloading the files will also cost you some.

--

--