How to delete huge data from DynamoDB table?

Satya Thirumani
Analytics Vidhya
Published in
8 min readApr 30, 2020

When I was tasked to delete terabytes of data from AWS DynamoDB tables, I tried the below approaches.

1) Drop the existing table & re-create it

2) Updating TTL (Time-To-Live) column

3) Delete using delete-item

4) Delete using batch-write-item

Delete records using delete-item command & Bulk Deletion of items using batch-write-item API & Updating Time

In Amazon DynamoDB, every item in a table is uniquely identified by its primary key. So primary key must be included with every item that is written in a DynamoDB table. In my case, it’s the composite primary key, where the first attribute is a partition/hash key and the second attribute is a sort/range key.

1) Drop the existing table & re-create it

If you required to delete all the items, you may consider dropping and re-creating the DynamoDB table.

This is the fastest & simplest method if all the items are to be deleted from the table, without spending time in scanning and deleting each item.

DeleteTable command/API deletes the table and all of its items. Table deletion may take around 2-4 minutes.

If you want to keep the settings you had before, you need to get schema matching --generate-cli-skeleton flag.

aws dynamodb delete-table --table-name test_dataaws dynamodb create-table --table-name test_data --attribute-definitions AttributeName=primary_key,AttributeType=S AttributeName=time_stamp,AttributeType=N --key-schema AttributeName=primary_key,KeyType=HASH AttributeName=time_stamp,KeyType=RANGE --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1

But there will be downtime for the table. Downtime duration depends on table deletion time, creation time plus the time to change/update DynamoDB table properties.

I tried this approach in non-production databases, but cannot follow this downtime approach, in production.

If you can afford to use a new table name, which involves application code change, create a brand new table and start writing to that table, and later you can remove the old table. There is no direct way of renaming DynamoDB table.

aws dynamodb create-table --table-name test_data_new --attribute-definitions AttributeName=primary_key,AttributeType=S AttributeName=time_stamp,AttributeType=N --key-schema AttributeName=primary_key,KeyType=HASH AttributeName=time_stamp,KeyType=RANGE --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1aws dynamodb delete-table --table-name test_data

2) Updating TTL column to delete items from DynamoDB

TTL (Time-To-Live) attribute is the expiry timestamp of DynamoDB table item. TTL attributes must be Number data type.

In our case, data was inserted into the DynamoDB table, without populating the TTL column.

Changing the TTL of items in the table would require updating each item’s TTL attribute’s value. Therefore, you have to perform a scan and filter the results you want to delete, and then perform an update-item on all the items to update the TTL value.

We cannot use update items in batch-write-item, we can use only put and delete requests.

I tried updating the TTL column (if it was zero), with the current/past epoch time format (e.g. 1684042494). Epoch time should not be older than 5 years.

The epoch time is the number of seconds that have elapsed since January 1, 1970.

To get the current epoch time, we can run “date +%s” command on any UNIX-flavor machines.

aws dynamodb scan --table-name test_data --projection-expression "primary_key, time_stamp" --expression-attribute-names '{"#ttlive":"ttl"}' --max-items 1000 --filter-expression "#ttlive = :ttzero AND time_stamp < :oldest_date" --expression-attribute-values '{":ttzero":{"N":"0"}, ":oldest_date":{"S":"2023-02-01"}}' > $SCAN_OUTPUT_FILEaws dynamodb update-item --table-name test_data --key file://$SCAN_OUTPUT_FILE --update-expression "SET #ttlive = :ttvalue" --condition-expression "#ttlive = :ttzero" --expression-attribute-names '{"#ttlive":"ttl"}' --expression-attribute-values '{":ttvalue":{"N":"1684042494"},":ttzero":{"N":"0"}}'

Before running scan operation, you have to increase RCUs (Read Capacity Units), if sufficient RCUs are not available or not on on-demand.

TTL compares the current time with the time stored in the TTL attribute of an item.

DynamoDB deletes expired items on a best-effort basis to ensure the availability of throughput (WCUs) for other data operations.

After the TTL expiry, the record will be marked for deletion, which will be deleted within 48 hours (as per AWS). I observed that it will take even longer, more than 15 days also, depending on table size & workload.

TTL deletes are not counted toward capacity units or request units. TTL deletes are free of cost.

If you would like to reduce table size or want to get rid of older data, updating TTL is not the recommended approach.

3) Deleting records from DynamoDB using delete-item command/API

In most of the RDBMS databases, delete commands will work with a condition that accepts exact values or pattern, and we can run delete queries by using non-primary key columns in the where clause.

But in NoSQL databases, like Dynamo DB, we have to provide exact values in the delete condition, and we must pass both the primary key & range key values of each item.

In Amazon DynamoDB, DeleteItem API deletes one item at a time.

In delete-item, conditional deletes (using condition-expression argument) are useful for deleting items, only if specific conditions are met.

As items are deleted, they are removed from any local secondary indexes and global secondary indexes, if there are any.

I was performing scans and filtering items from the table to delete them.

aws dynamodb scan --table-name "$TABLE_NAME" --projection-expression "primary_key, time_stamp" --filter-expression "time_stamp < :oldest_date" --expression-attribute-values '{":oldest_date":{"S":"2023-02-01"}}' --max-items 1000 --total-segments "$TOTAL_SEGMENT" --segment "$SEGMENT_NUMBER" > $SCAN_OUTPUT_FILEcat $SCAN_OUTPUT_FILE | jq -r ".Items[] | tojson" | tr '\n' '\0' | xargs -0 -I keyItem aws dynamodb delete-item --table-name "$TABLE_NAME" --key=keyItem

A single scan operation reads up to the maximum number of items set specified by --max-items/Limit.

Scan operations proceed sequentially, for faster performance on larger tables, we can request a parallel scan operation by providing the Segment and TotalSegments parameters.

TotalSegments denotes the total number of workers that will access the table concurrently.

Segment denotes an individual segment of the table to be accessed by the calling worker/script. The segment must be greater than or equal to 0, and less than the value provided for TotalSegments.

I was using segmentation/pagination, running in 10 segments and each segment was deleting 1000 items, which was taking 20-30 minutes. So, to delete 10,000 items, the script was taking around 20-30mins.

4) Bulk Deletion of DynamoDB items using (AWS CLI) batch-write-item

BatchWriteItem operation puts or deletes multiple items in one or more tables.

To speed up the process, instead of deleting items one by one, you can delete up to 25 items in one BatchWriteItem call.

batch-write-item in CLI requires a RequestItems JSON which specifies the individual request of delete.

By using JQ utility, we can convert the scan result page into the request items format.

aws dynamodb scan --table-name "test_data" --projection-expression "primary_key, time_stamp" --filter-expression "time_stamp < :oldest_date" --expression-attribute-values '{":oldest_date":{"S":"2021-04-01"}}' --max-items 25 --total-segments "$TOTAL_SEGMENT" --segment "$SEGMENT_NUMBER" > $SCAN_OUTPUT_FILEcat $SCAN_OUTPUT_FILE | jq -r ".Items[] | tojson" | awk '{ print "{\"DeleteRequest\": {\"Key\": " $0 " }}," }' | sed '$ s/.$//' | sed '1 i { "test_data": [' | sed '$ a ] }' > $INPUT_FILEaws dynamodb batch-write-item --request-items file://$INPUT_FILE

Before running scan command, increase RCUs (Read Capacity Units), and before running batch-write-item CLI/API command, increase WCUs (Write Capacity Units).

I was using more segments, which may reduce the result set of each scan segment, which would help have equivalent parallel delete-item calls as per the number of segments. This would increase the parallelism of delete and help speed up the deletion of items from the table.

DeleteItem operations mentioned in BatchWriteItem are atomic. If operations fail because of any error, only failed operations are returned in the output.

As of May 2023, 25 requests in batch-write-item is a hard limit in Dynamo DB and cannot be increased.

Obviously, we will be having more than 25 records in our list of records to delete, so we have to use loops to iterate over all the records.

Scanning using loops (for each segment):

vi scan_dynamo_table.shTOTAL_SEGMENTS=$1
SEGMENT_NUMBER=$2
SCAN_OUTPUT="scan-output-segment${SEGMENT_NUMBER}.json"
SCAN_AGGREGATE="scan-agg-segment${SEGMENT_NUMBER}.json"
aws dynamodb scan --table-name "test_data" --projection-expression "primary_key, time_stamp" --filter-expression "time_stamp >= :start_date and time_stamp <= :end_date" --expression-attribute-values '{":start_date":{"S":"2023-01-01"}, ":end_date":{"S":"2023-03-31"}}' --max-items "1000" --total-segments "${TOTAL_SEGMENTS}" --segment "${SEGMENT_NUMBER}" > ${SCAN_OUTPUT}NEXT_TOKEN="$(cat ${SCAN_OUTPUT} | jq '.NextToken')"
cat ${SCAN_OUTPUT} | jq -r ".Items[] | tojson" > ${SCAN_AGGREGATE}
while [ ! -z "$NEXT_TOKEN" ] && [ ! "$NEXT_TOKEN" == null ]
do
aws dynamodb scan --table-name "test_data" --projection-expression "primary_key, time_stamp" --filter-expression "time_stamp >= :start_date and time_stamp <= :end_date" --expression-attribute-values '{":start_date":{"S":"2023-01-01"}, ":end_date":{"S":"2023-03-31"}}' --max-items "1000" --total-segments "${TOTAL_SEGMENTS}" --segment "${SEGMENT_NUMBER}" --starting-token "${NEXT_TOKEN}" > ${SCAN_OUTPUT}

NEXT_TOKEN="$(cat ${SCAN_OUTPUT} | jq '.NextToken')"
cat ${SCAN_OUTPUT} | jq -r ".Items[] | tojson" >> ${SCAN_AGGREGATE}
done

Using N segments do the scanning, with the required selection criteria, which will generate N files (scan-agg-segment*.json), with all the required records to be deleted.

vi run_scan_table.shTOTAL_SEGMENTS=100          # N, total number of segmentsfor SEGMENT in `seq 0 $((${TOTAL_SEGMENTS}-1))`
do
nohup sh scan_dynamo_table.sh ${TOTAL_SEGMENTS} ${SEGMENT} &
done

Deleting using loops (for each segment):

Then you can run N batch-write-item scripts, as shown below, which will read the respective file (generated above) and will delete 25 records at a time, in the loop. Here N can be 10, 1000, or any number (depending on your server).

vi batch_write_item_delete.shSEGMENT_NUMBER=$1
SCAN_AGGREGATE="scan-agg-segment${SEGMENT_NUMBER}.json"
SEGMENT_FILE="delete-segment${SEGMENT_NUMBER}.json"
MAX_ITEMS=25 # maximum number of items batch-write-item accepts
printf "starting segment - ${SEGMENT_NUMBER} \n" > ${SEGMENT_FILE}until [[ ! -s ${SEGMENT_FILE} ]] ;
do
awk "NR>${CNT:=0} && NR<=$((CNT+MAX_ITEMS))" ${SCAN_AGGREGATE} | awk '{ print "{\"DeleteRequest\": {\"Key\": " $0 " }}," }' | sed '$ s/.$//' | sed '1 i { "test_data": [' | sed '$ a ] }' > ${SEGMENT_FILE}

aws dynamodb batch-write-item --request-items file://${SEGMENT_FILE}
CNT=$((CNT+MAX_ITEMS))done

Run the above delete script.

vi run_batch_delete.shTOTAL_SEGMENTS=100        # N, total number of segments/files existsfor SEGMENT in `seq 0 $((${TOTAL_SEGMENTS}-1))`
do
nohup sh batch_write_item_delete.sh ${SEGMENT} &
done

By using the delete-item command, with 100 segments, scripts were able to delete 2500-2700 items per minute (on c5.4xlarge EC2 machine).

By using the batch-write-item command, with 100 segments, scripts were able to delete 28000-45000 items per minute (on c5.4xlarge EC2 machine).

I tried the same delete operation with r5 (r5.4xlarge) and m5 (m5.4xlarge) EC2 machines.

batch-write-item helped me in deleting more than 5 millions records per hour, with m5/r5/c5 instance types.

Summary

jq, JSON query processor, is very helpful, by using which we can play and format JSON output provided by DynamoDB queries.

For pre-production databases or for any non-critical AWS DynamoDB tables, better to follow the downtime approach, where you have to drop the existing table, recreate a blank table and start using it.

For production databases and critical Amazon DynamoDB tables, the recommendation is to use batch-write-item to purge terabytes of data.

batch-write-item (with DeleteRequest) is 10 to 15 times faster than delete-item.

--

--