Loading Large Datasets into Dynamo DB Tables Efficiently

Abhisek Roy
Credit Saison (India)
5 min readJun 27, 2023

One of the most easy-to-use databases in the AWS ecosystem is the DynamoDB. It’s a fully managed solution, so you just need to create tables and start using them. You pay only for what you use–read, write, and storage.

An everyday use case is moving offline files such as CSV files to a Dynamo DB table for quick access from EC2 instances or lambdas. However, there is no Dynamo DB functionality to do this automatically. Most articles that you will find on the internet will ask you to set up a Dynamo DB table in On-Demand Mode and let it handle the loads whenever you want to dump a new CSV file into it. However, this is not the most efficient way to go about things.

To understand why using the Dynamo DB in On-Demand mode is not the most efficient one, let’s go over some figures.

Dynamic vs Provisioned Pricing for Dynamo DB

Pricing for on demand, Dynamo DB reads and writes–

| On Demand Throughput Type   | Price              |
| --------------------------- | ------------------ |
| Write Request Units | $1.25 per million |
| Read Request Units | $0.25 per million |

Pricing for provisioned Dynamo DB reads and writes–

| Provisioned Throughput Type   | Price per hour     |
| ----------------------------- | ------------------ |
| Write Capacity Units | $0.00065 per WCU |
| Read Capacity Units | $0.00013 per RCU |

While one is priced on write/read units, the other is based on the exact number of read/writes that you perform. This can lead to confusion. Let’s simplify it.

Suppose you need to copy a CSV file with 15 million rows into a DynamoDB table. AWS will charge you 15*$1.25 = $18.75.

Now if you instead write it using 4000 WCUs, you would be able to process the data in 15,000,000/4000 = 3750 seconds, which is (3750/ (60*60)) = 1.04 hours. Total cost would be (1.04*4000*0.00065) = $2.704

While the difference in amount may seem to be less, the critical point is that, with careful planning, we could reduce the cost by 7 times.

Provisioned Dynamo DB for On-Demand use cases

Now imagine you get 5 different files (let’s say user data, product data, inventory data, website data and warehouse data) once a day every day at 10 o'clock. Each file contains a few million rows. So all you need to do is bump up the WCU count for the 5 DynamoDB tables in question, copy the data from the CSV files into the 5 tables, and then decrease the WCU count again so that you don’t incur massive costs.

To orchestrate this process, we used a Step Function with each stage in it as a Lambda. In our case, we weren’t getting the files directly because of which you will see that our first step is RunQuery- we are running a query on Athena to generate a CSV file. In case you are getting the file from an external source, you can start from the 2nd stage, SplitFile.

Fig: Step function for saving large CSV files in DynamoDB tables.
import awswrangler as wr

for i, chunk in enumerate(wr.s3.read_csv(file_location, path_suffix='csv', chunksize=CHUNK_SIZE, encoding='cp1252')):
wr.s3.to_csv(chunk, path=chunk_path + '{}.csv'.format(i), index=False)

The code above is the heart of the split file task. In this, we are splitting the source file into multiple files of 100,000 rows each. Why are we doing this? Let that be a surprise till the end.

In the next step, in IncreaseDynamoCapacity, we are increasing the WCU (Write Capacity Units) of the dynamo DB table in which we will be saving the CSV file. We have a config file that stores the WCU required to process each file. You can do the same. Imagine for the user data file, we increase the WCU for the user_data DynamoDB table to 2500 units.

Next, we have a state called CheckIfTableUpdated to check if the DynamoDB table has been updated. This step checks if the table is in ready state (which would mean that its WCU count has increased. It performs a check and if the table is still updating, then fails. We have a retry logic here set on the CheckIfTableUpdated stage so that it checks every 5 minute or so whether the table is ready. This is needed since a DynamoDB table doesn’t scale up its WCUs automatically.

Once the table is ready to write to, we go to the AddDatainDynamo stage. This is a map state and the core logic that goes into it is–

import awswrangler as wr
import numpy as np

SPLIT_LENGTH = 10000

def df_to_dynamodb(df, table_name):
df = df.fillna(0)

# convert any floats to decimals
for i in df.columns:
datatype = df[i].dtype
if datatype == 'float64':
df[i] = df[i].apply(float_to_decimal)

# write to dynamodb
wr.dynamodb.put_df(df=df, table_name=table_name)

df = wr.s3.read_csv(path=file_key)
df_length = len(df.index)

number_of_splits = df_length // SPLIT_LENGTH

if number_of_splits > 0:
frames = np.array_split(df, number_of_splits)
frame_counter = 1
for frame in frames:
df_to_dynamodb(frame, table_name=table_name)
else:
df_to_dynamodb(df, table_name=table_name)

This being a map state, runs for each of the files that we had created in the SplitFile stage. So imagine your source file had 10,000,100 rows. It will be broken into 101 files. 100 files of 100,000 rows and 1 file of 100 rows. So this step will also run 101 times. In this step, we further break each file into 10 chunks of 10,000 each so that we can write the data to the DynamoDB table in chunks of 10,000.

Fig: The process of splitting the source file and saving it to a DynamoDB Table

After this, we have the DecreaseDynamoCapacity stage where we decrease the WCU of the DynamoDB table to its original count, which can be 1.

Wrapping up…

In short, we had to copy a massive CSV file to a Dynamo DB table, so we a) broke it into chunks of a very specific size b) increased the WCU of the table c) copied the data chunk by chunk and then d) decreased the WCU of the table. The reason why we broke the file to smaller files and then copied the data in chunks is two fold:

  1. We broke the individual files into chunks of 10,000 so that we can make do with lesser WCUs and so that we can decide on a fixed max WCU count to maintain for a limited time without getting errors about insufficient WCUs.
  2. We broke the first file into multiple files of 100,000 since a lambda has a timeout of 15minutes and if we started off with the source file and pulled rows of 10,000 from it, to write to the DynamoDB table, then we would have encountered a timeout.

In this way, we were able to create a low cost serverless solution that would provision WCUs for a DynamoDB table only for the amount of time it takes to copy the data. This helped us overcome the higher cost of creating a DynamoDb table with a very high provisioned WCU or an on-demand DynamoDB.

--

--