DynamoDB Table Migration Between Accounts

If you’re like me, you have a love-hate relationship with DynamoDB. It’s an excellent and simple service that can scale (to Petabytes!) and can handle high throughout — the pain that comes with it is that it’s hard to move data around.

Let’s assume you have more than one AWS account, which most big companies have. Now, let’s say you want to move a table — a big table — above 1TB of data. What are our options?

The first option is to use AWS data pipelines(1). I’m not a fan of this solution since it doesn’t let me solve the problems myself. If you ask AWS support, they will surely answer, but I prefer something that I have more control over. Another option is Glue(2), an excellent service, but it also forces you to use Glue crawler, which is great but doesn’t do so well on big tables, and recently we got a new option with API/console(3) that can restore a DDB table from another account (using AWS Backup).

In this blog post, I will show you another way to copy a table using Spark on EMR with a “little” project from AudienceProject, which took AWS’s DynamoDB Spark driver and enhanced it. This gives you much more than simply copying a table, and I will show another usage at the end.

Creating Spark cluster with EMR

First, let’s create a Spark EMR. You’ll need to pay attention to a few tricks, especially if you don’t have a significant Spark/EMR background (something that I don’t have).

How to create the EMR cluster from AWS UI

Now that we have our cluster, we’re almost done. We will need to create the IAM roles & policies (tiny step), but I’ll cover that towards the end.

Spark magic

Since I’m only showing how to copy a table, I won’t manipulate it, but you — well… you can turn it into a water bear if you want.

First, log in to your master node in the EMR (you need to have the SSH key you used creating the cluster).

Now run Pyspark with this command:

pyspark --packages com.audienceproject:spark-dynamodb_2.12:1.1.2

Basically, I’m adding the AudienceProject driver to our Pyspark. A little note here I really recommend working with “screen” (or an equivalent) to keep the session running while you eat, sleep, code, repeat.

Okay, now we’re getting things done. We have a session that won’t close, we have Spark running with DDB driver (enhanced!), and the last part is the DataFrame code. I don’t know a lot about DF, but I do know it works like magic.

Read DF code

movers = spark.read.option(“tableName”, “source.ddb.table”) \ .format(“dynamodb”) \
.option(“targetCapacity”, 0.95) \
.option(“bytesPerRCU”, 4096) \
.option(“defaultParallelism”, 16) \
.option(“maxPartitionBytes”,20971520) \
.load()

Write DF code

movers.write.option(“tableName”, “target.ddb.table”) \
.format(“dynamodb”) \
.mode(“append”) \
.option(“roleArn”, “arn:aws:iam::123456789012:role/aws-emr-dynamodb-role”) \
.option(“targetCapacity”, 0.95) \
.option(“throughput”, 4194304) \
.save()

The crux of it is the parallelism and the size of the partition. I’m using as many partitions as I can with a high degree of parallelism, which, according to the tests I did, translates to each worker representing 1K of write objects. The worker machines are rather small, so that’s okay. You should make sure to set the write capacity on the target table to provisioned (and high provision) — in the tests I have done, it was clearly shown that I could work with “on-demand” at the source table (use provisioned here as well, it will save you money), but you must use the provisioned mode on “write capacity” at the target table.

Gaining access with IAM role

The last piece of this puzzle is the IAM; there are four rules you need to follow:

1. Create an EMR role in account A, this replaces the default one, “EMR_EC2_DefaultRole”

2. Create a role in account B with the needed DynamoDB access

3. Add trust relationship to the role in account B — “allow assume” from account A EMR EC2 role

4. Add IAM access to account A role — allow to assume account B role

Code example

The role in account A has:

  • AWS policy — AmazonElasticMapReduceforEC2Role
  • AWS policy — AmazonKinesisReadOnlyAccess
  • Custom policy -
{
“Version”: “2012–10–17”,
“Statement”: [
{
“Action”: “sts:AssumeRole”,
“Resource”: [
“arn:aws:iam::<Account A ID>:role/<Name of the role in account B>”
],
“Effect”: “Allow”
}
]
}
  • Trusted entities — The identity provider(s) ec2.amazonaws.com

The role in account B has:

Custom policy — 
{
“Version”: “2012–10–17”,
“Statement”: [
{
“Action”: [
“dynamodb:BatchGet*”,
“dynamodb:DescribeStream”,
“dynamodb:DescribeTable”,
“dynamodb:Get*”,
“dynamodb:Query”,
“dynamodb:Scan”,
“dynamodb:BatchWrite*”,
“dynamodb:CreateTable”,
“dynamodb:Delete*”,
“dynamodb:Update*”,
“dynamodb:PutItem”
],
“Resource”: [
“arn:aws:dynamodb:us-east-1:<Account B ID>:table/*”
],
“Effect”: “Allow”
}
]
}
  • Trusted entities — arn:aws:iam::<Account A ID>:role/<Account A role name>

Thanks to our Security team (Cyber, not the one in the lobby, although we love them too), that helped me figure this out.

Please note that my settings here can probably be improved; feel free to do so.

Bonus code

I told you guys at the beginning that I would show you another usage for Spark and DynamoDB. First a little background: we needed to update TTL (Time To Live) on all of our tables and update that field using a create date field from the same table. Each table has a different create date field.

Without further ado, the code:

import pyspark.sql.functions as Fread_df = spark.read.option(“tableName”,“production.table.that.need.update”) \
.format(“dynamodb”) \
.option(“targetCapacity”, 0.8) \
.option(“bytesPerRCU”, 4096) \
.option(“defaultParallelism”, 40) \
.option(“maxPartitionBytes”,20971520) \
.load()
no_ttl_df = read_df.filter(F.col(“ttl”).isNull())miss_data_df = no_ttl_df.filter(F.col(“create_ts”).isNotNull())small_id_list_df = miss_data_df.filter(F.size(F.col(“id”)) < 7000)write_df = small_id_list_df.withColumn(‘ttl’,(F.col(‘create_ts’)/1000+126227808).cast(‘long’))write_df.write.option(“tableName”,“production.table.that.need.update”) \
.format(“dynamodb”) \
.mode(“append”) \
.option(“targetCapacity”, 0.8) \
.save()

Short translation: for each key in the table that has no value in “ttl”, has value in “create_ts”, and the number of “id”s in the array is smaller than 7000 — take the value in “create_ts” manipulate it and put it in “ttl”.

That’s about it, I did not manage to make it work as expected on large tables (> 1TB), but other than that, you can write a migration flow with it, you can use it to manipulate DDB table data — but however you use it, don’t forget to set write provisioning and treat yourself to a beer when it’s done.

References:

  1. Data pipeline blog on how to move DDB table
  2. Using glue and DDB
  3. Restore a DDB table from backup

--

--

--

Software Engineering, Research, Data, Architecture, Scaling and more, written by our very own engineers and data scientists.

Recommended from Medium

Space Falcon (FCON) Gets Listed on KuCoin! World Premiere!

Introduction to Elm Programming- the Future of Front-End Development

The Essential SQL Commands Cheat Sheet for Beginners

The Essential SQL Commands Cheat Sheet for Beginners

What it takes to become a Salesforce AppExchange ISV

10 Free and Open-Source Tailwind CSS Templates

How to Choose Appropriate Framework in 2022 | Agiliway

API Blueprint: Markdown is awesome!

PancakeSwap Migration Vote Update

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Avishai Mizrahi

Avishai Mizrahi

Manipulating data since 2007, SQL as a base, Martial Artist, Mover and learner.

More from Medium

Implementing a load-balanced ECS service using the AWS CDK: Lessons learned

Traffic Management at the Edge

How we build a fully regional cloud architecture on AWS

Somebody Else’s Problem