Optimise Data Governance: Automatic Data Deletion

To help with best practice data governance, here are some mechanisms for easily using AWS to automatically delete the data.

S3

Amazon Simple Storage Service (S3) is a highly scalable and durable object storage service. For most of us, this service was our first introduction to AWS. From an integration application perspective, your application will mostly be reading/saving objects from/into S3.

One of the easiest ways to make sure that the objects gets deleted automatically is via Lifecycle rules, a feature provided by Amazon S3. You can specify when the deletion should occur based on the object’s age, date, or number of days since creation.

aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket-name \
--lifecycle-configuration file://lifecycle-config.json

Here is an example of a lifecycle configuration file that deletes objects after 1 day:

{
"Rules": [
{
"ID": "rule-1",
"Status": "Enabled",
"Expiration": {
"Days": 1
},
"Filter": {
"Prefix": ""
}
}
]
}

So ensure you have a lifecycle rule for deletion. AWS provides a lot of features such as filtering, multiple rules, prefix, tags, … etc.

DynamoDB

DynamoDB is a managed NoSQL database service provided by AWS.

Mostly, you would be using DynamoDB to support operational and/or transactional data access patterns, including conditional writes, atomic counters, and batch operations.

You can use Time To Live (TTL) feature to automatically delete data in Amazon DynamoDB. TTL allows you to define a timestamp attribute on your items, and AWS would automatically delete items when the current time is greater than the timestamp plus the TTL duration.

The TTL attribute’s value must be a timestamp in Unix epoch time format in seconds, as of when I was writing this.

aws dynamodb update-time-to-live \
--table-name <table-name> \
--time-to-live-specification Enabled=true,AttributeName=<ttl-attribute-name>

So make sure to include a time to live attribute in every record you insert into your table (I mean PutItem)

SQS

Simple Queue Service is a fully managed message queuing service by AWS that enables decoupled communication between distributed application components.

In a decoupled distributed application, an application or some action would put a message into the queue and another application would read, process, and delete the message from the queue. If you are using Spring then you would have an annotation like @SqsListener(value = "${queue.name}", deletionPolicy = ON_SUCCESS)

The default retention period for a message in a queue is 4 days. However, you can set the retention period to a minimum of 60 seconds and a maximum of 14 days.

aws sqs set-queue-attributes \
--queue-url <queue-url> \
--attributes "MessageRetentionPeriod=<duration-in-seconds>"

In a real world scenario, you would not retain messages for long. If the application can’t process the message, you could retry or configure a dead-letter queue (DLQ) for further analysis. Though, the number of messages in the DLQ should be maintained to a minimal number as a best practice. So make sure you use appropriate retention period for your queue. We went even a bit further, with the help of JMX you could accept DLQ parameters and reprocess these messages when an upstream system is online or the issue is rectified. When you have Spring Boot Admin then it can be used to view and manage these JMX metrics through its web interface, well this is another blog another day.

Kinesis

Amazon Kinesis is a cloud-based service for real-time processing of streaming big data. Typically used in use cases where you need to ingest, process, and analyse real-time data streams at scale.

Amazon Kinesis data retention is the amount of time that data records are stored in a Kinesis stream. By default, data records are stored in a Kinesis stream for 24 hours, but you can increase the retention period up to 8760 hours (365 days). The retention period determines the amount of time that you have to process and analyse data in a Kinesis stream before it is no longer available. When the retention period is exceeded, data records are automatically deleted.

You can change the data retention period using update-stream. Here’s a full example to change the data retention period to 12 hrs:

aws kinesis update-stream --stream-name my-stream --retention-period-hours 12

Conclusion

The ideal data retention period for your solution should be constantly reviewed.

Data governance is not just a responsibility of a department, it’s your responsibility as well.

Reference

  1. Using DynamoDB (TTL)
  2. Examples of S3 Lifecycle configuration
  3. Spring AWS Cloud — Messaging
  4. SQS visibility timeout
  5. Kinesis — Data Retention Period

Information has been prepared for information purposes only and does not constitute advice.

--

--