Simplifying DynamoDB Housekeeping tasks at Simplilearn
Developers and solution architects would agree, DynamoDB is unforgiving. The cost implications of a DynamoDB table which needs to be redesigned, come up far too frequently and quite late in the lifecycle of the application.
You might have a live production table which has accumulated data, and you realize that the table is costing far too much or the partition key/indices are no longer relevant to the current architecture.
Or, you might have a table with a high percentage of cold data and it is adding to your costs. You will struggle to archive the data if the table design does not allow optimized scanning and querying.
How do you deal with this problem? How do you migrate data from the old table to the new table with least operational complexity?
Common housekeeping activities
Designing a table on DynamoDB is a different ball game when compared to a simple RDBMS design. The data is populated after identifying the queries required and not vice-versa.
Consider a software lifecycle with growing requirements and features, it is unlikely an architect would be able to look far into the future and design a “master” table — which can serve everyone and everything!
What are the common issues architects face when they realize — “Oh no, I need a new table BUT with the same data!”
Modifying the primary or sort key
DynamoDB tables do not allow the primary and sort key to be modified after creation. The new access patterns can no longer be served with a “DynamoDB Query” and need a “DynamoDB Scan”. A scan operation is more expensive and slower compared to a query operation.
Alternatively you can create a GSI (Global Secondary Index) to solve this problem, but it might turn out to be an expensive solution! GSIs come with their own storage and read/write capacity costs.
Modifying Local Secondary Indices (LSI)
DynamoDB table allows LSIs to be configured only during creation. Once the table is created, the LSI can neither be modified or deleted. LSIs consume the read/write capacity units of the main table but the storage is priced separately. Imagine you have a 1TB DynamoDB Table and realize the LSIs created years ago, are no longer required!
Data cleanup
Consider you have a table with 90% cold data. The requirement is to delete data between Year 2010 and Year 2020.
Assuming you never knew this would happen, your table may not have the appropriate keys to allow an optimized query operation.
The sample code would look something like this :
const params = { TableName: "MyHeavyTable" };var params = {
TableName: "MyHeavyTable",
ProjectionExpression: "#date, mainID"
FilterExpression: "#yr between :start_yr and :end_yr",
ExpressionAttributeNames: {
"#yr": "year",
},
ExpressionAttributeValues: {
":start_yr": 2010,
":end_yr": 2020
}
};docClient.scan(params, (error, result) => {
if (error) {
console.log(error, "error scan");
}result.Items.forEach(function (item) {
docClient.delete({ Key: { mainID: item.id }, TableName: "MyHeavyTable" }, (error) => {
if (error) {
console.log("Delete fail");
}
console.log("Delete Success");
});
});});
This would require the Read capacity units for the table to be increased during the archiving operation. Also, the time taken to delete the data would run into hours as the data scan operation can return only maximum 1MB data in a single request, which means you’ll need to paginate through the results.
Adding a TTL field to an existing table
AWS released a feature in 2017, which allowed developers to configure a TTL attribute to a DynamoDB table which will automatically delete items on expiry. But the caveat is — the field will have values for new items going forward, how do you update the TTL value for existing items?
If you are working with billions of records, you would need to run a batch operation to update the records to populate the TTL value — which comes with its own added costs and operational complexity.
The options that the engineering team at Simplilearn explored to solve this problem were using :
- AWS DMS
- DynamoDB Table Export and Import
- AWS Batch
But all of them involved high operational complexity or costs. Then we came across Dynamo DB Streams!
DynamoDB Streams to the rescue!
A DynamoDB stream is an ordered flow of information about changes to items in a DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table.
DynamoDB Streams helps ensure the following:
- Each stream record appears exactly once in the stream.
- For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item
How did DynamoDB streams save the day?
- We wanted to change the LSI of the DynamoDB Table
- We wanted to archive 90% of the data
- We wanted to enable DynamoDB TTL and populate the TTL attribute for all items
The solution using DynamoDB Streams:
- Create a brand new DynamoDB table “MyHeavyTable_v2”. During table creation configure the required LSIs as per new requirements.
- Enable DynamoDB TTL on a new attribute “item_ttl”
3. Configure a high write capacity in preparation of the copy operation
4. Enable DynamoDB Streams on “MyHeavyTable”
Enabling DynamoDB streams will only capture all new items and changes to existing items made after the feature has been enabled. The old data will not be available on the stream.
5. Add a trigger on the DynamoDB table to invoke a Lambda function. This lambda function is responsible to perform the copy operation
6. Allow the data to flow into the new table for the period of time as specified by the required “TTL” value
7. The new table starts getting data for all new items and items updated after enabling DynamoDB stream. If the requirement is to store data only for last 30 days, you will allow to data to flow into this new table for 30 days. After which it is ready to be switched.
8. Switch the configuration of the application to point to the new table “MyHeavyTable_v2”
9. And Voila! the table has been successfully migrated.
Note: The entire table will not be migrated. Only items added after enabling DynamoDB streams will be migrated. If you want to migrate the entire table, all the items of the previous table will have to be updated to capture a change in the DynamoDB Stream.
Code for Lambda Function :
import json
import datetime
import calendar
import boto3import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)dynamodb_client=boto3.client('dynamodb')def lambda_handler(event,context):
process_records(event)
def getTTL(days=15):
logger.info
future = datetime.datetime.utcnow() + datetime.timedelta(days=15)
return str(calendar.timegm(future.timetuple()))def process_records(event):
try:
logger.info("Processing records...")
for record in event['Records']:
new_image = record['dynamodb']['NewImage']
new_item=new_image
if 'ttl' not in new_image:
new_image['ttl']={"N":getTTL()}new_item['ttl']=new_image['ttl']
add_to_dynamodb(new_item)
logger.info("Processing complete...")
except Exception as ex:
logger.fatal(ex)
raise exdef add_to_dynamodb(item):
try:
dynamodb_client.put_item(TableName='MyHeavyTable_v2',Item=item)
except Exception as ex:
logger.fatal(ex)
raise ex
We would love to hear from you on any other methods we could have used to solve this problem. Please feel free to leave your comments below.
Meanwhile, if you found this interesting, I would encourage you to read up further on DynamoDB through the various helpful resources that are available online. Additionally, you could also check out our well-structured AWS Cloud Architect Certification course at Simplilearn which enables you to master the core skills required for designing and deploying cloud solutions on the Amazon Web Services platform.