Data Governance with Amazon Macie

Gerald Bachlmayr
5 min readSep 17, 2020

Why Do We Need Data Governance?

Every organisation needs to protect their data to avoid any leakages of confidential information and to meet regulatory requirements, such as GDPR (General Data Protection Regulation). A data protection strategy articulates how an organisation can be protected from data loss and the processes and controls that need to be in place for personal data. An ongoing data governance approach makes sure that the data management is performed in a secure and adequate way that aligns with your data protection strategy.

If we deal with sensitive PII (personal identifiable information) then we need more security controls compared to a solution that does not deal with PII. Using more comprehensive security controls results in higher cost. Therefore we can potentially reduce the cost of security controls if we can assure that you do not need to use any PII information — for example: anonymising data for development and integration environments.

How can we validate and prove that there is no PII in our data storage, such as an S3 bucket? If there is, how do we know what information it is, as well as where it is? Once we know this, only then are we able to tackle the root cause of the problem. This is exactly where Macie can help us.

How Does Macie Work?

Macie is a fully managed data security and data privacy service that utilises machine learning to discover sensitive data.

Macie can identify two categories of findings: policy findings and sensitive data findings. A policy finding is a detailed report of a policy violation for an S3 bucket and sensitive data finding is a detailed report of sensitive data in an S3 object (a file). In this blog post we are focusing on the second type: the sensitive data findings.

For sensitive data scans we can either configure scheduled scans or one-time jobs. The Macie dashboard shows an inventory of your S3 buckets and an overview of sensitive information found.

Macie provides the following types of sensitive data findings:

  • Credentials — e.g. private keys or AWS secret key
  • Custom Identifier — an object matches a custom data identifier that matches your regular expression
  • Financial — e.g. credit card numbers or bank account numbers.
  • Multiple — The object contains more than one type of sensitive data.
  • Personal — The object contains personal information, such as full names, mailing addresses, or identifications numbers.
Macie findings — overview

By selecting a finding Macie provides a detailed view showing exactly what information was found in a file. In the example below Macie found a credit card number, 9 names, 1 address and several phone numbers:

Macie findings — detailed view

Step By Step instruction

First of all we start with configuring a scheduled CloudWatch rule that triggers a regular Macie scan. In our example we define a daily job with a cron expression in UTC (Coordinated Universal Time).

Scheduled CloudWatch Event Rule
Macie scan — scheduled CloudWatch rule

The CloudWatch rule will trigger our Lambda function every 24 hours. The Lambda function will initiate a Macie scan for one or more S3 buckets.

End-to-end flow

The scanning of an S3 bucket can be implemented by using the create_classification_job method. Within the method we define our one-time job, that gets triggered by our scheduled CloudWatch rule.

Lambda triggering a Macie scan
Macie scan Lambda

Additionally you can define a scope in the create_classification_job method and define criteria that you want to include or exclude.

This comes in handy if you want to tag resources that you have already scanned. In that case you can exclude those tagged resources from repetitive scans to save time and cost.

Once the scan is triggered the actual scan process will be started asynchronously. That means the scan is running in the background.

Therefore we want to intercept the CloudWatch Event that is triggered when any findings occur. The severity score for findings has five levels, from zero (low) to 4 (very high). You can filter notifications based on those scores as we can see below in the following CloudWatch rule.

CloudWatch Event Rule for Macie findings
Macie findings — CloudWatch rule

Once a finding is detected within our defined score range (in our example 3 - 4) the Notification Lambda is triggered. The Lambda sends the findings via AWS SES (simple email services).

In our email subject and body we can use the CloudWatch Event details to provide more information regarding the finding, e.g.:

Retrieving Macie finding parameters from the CloudWatch Event
Fetching Macie event details

Key Takeaways and Pricing

Macie is a data protection service that helps you to identify PII in your S3 buckets. By gaining this insight you can validate if the actual data handling approach matches your design.

Macie can detect sensitive information in text based files such as TXT or CSV and PDF files, even if those files are part of a ZIP file.

AWS gives you a 30 day free evaluation period. After that you will be charged. The sensitivity scanning pricing model is based on the amount of data that is being scanned. The price per scanned GB drops when you hit the next discount threshold. Prices differ per AWS region.

The following table from the AWS website gives you price overview for the us-east-1 (Virginia) region in USD:

Macie costing model