Anonymising real time logs in AWS CloudFront
CloudFront allows you to log requests in two different ways.
- Standard logs
- Real time logs
The standard logs are great, there is no additional cost, and you only pay for the storage (S3). Unfortunately, they do not allow you to modify the fields that are being sent into the log stream. One of those fields is c-ip which is IP address of a person visiting your website. Under the GDPR an IP address is considered as personal information which must be processed in accordance with EU law.
On the other side of the spectrum the real time logs give you much more flexibility. You can define fields that you want to log in real time records. If you are not required to comply with requirements that mandate the logging of IP addresses, consider excluding them from your log configuration to maintain privacy.
IP Masking
If, after considering privacy concerns, you still wish to retain IP addresses in the logs, anonymising the data may be a viable option. Let’s take a closer look at the architecture for real-time logging:
The CloudFront logs are being delivered to Kinesis data stream and then consumed by Firehose and sent to destination. The Firehose gives you ability to transform incoming data and deliver processed data to destination which is a S3 bucket in this case. We can write a simple Lambda function that will mask IP addresses on the fly by setting the last octet of IPv4 and the last 2 octets of IPv6 addresses to zeros. Here’s an example of how that function could look like.
And the example log output after masking IP address:
1670495826.812 31.61.160.0 GET https www.example.com IPv4 TLSv1.3 PL
1670495831.636 51.178.130.0 GET http www.example.com IPv4 - FR
1670495849.538 68.235.60.0 GET http www.example.com IPv4 - US
1670495861.480 68.235.60.0 GET https www.example.com IPv4 TLSv1.2 US
1670495871.707 18.215.20.0 GET https www.example.com IPv4 TLSv1.2 US
Deployable stack
And finally, here’s a full CloudFormation template that will deploy a CloudFront distribution with real time logging enabled, s3 bucket and a data transformation Lambda function that will take care of data masking.
Summary
In conclusion, it is important to be mindful of privacy when logging requests on your website. While standard logs from CloudFront are cost-effective and convenient, they do not offer the flexibility to modify the fields being logged. Real-time logs offer increased control over the fields being logged, but come at an additional cost. If privacy is a concern and you are not required to comply with regulations that mandate the logging of IP addresses, it is recommended to exclude them from the log configuration or consider anonymizing the data. The choice between standard logs and real-time logs, and the handling of IP addresses in logs, will depend on your specific privacy requirements and regulations.