Syncing DynamoDB between AWS environments

Bo Bracquez
6 min readDec 7, 2019

--

I have to sort of apologise somewhat for my lack of activity… This was mainly because I was occupied with looking for partners to work along with for my final work for university. And I could not really figure out how I wanted to format this post, and I am still not happy with how it turned out… But here we go!

We have taken a look at setting up a sync between two different endpoints for S3 buckets before as adressed here and here, and now I would like to continue with DynamoDB. This will probably be the last in the series where I write a post about setting up a sync in specific because all good things come in sets of 3… right? Jokes asides, this is mainly to focus more on integrating the mocking story within a CI/CD environment. I am currently working on making a plugin for Jenkins and I will be documenting my adventure pretty soon, I promise! Enough chit chat, let’s start the p̵a̵r̵t̵y technical part of this post.

A database is not a filesystem

We have already explorerd the possibilities of syncing S3 buckets. S3 buckets were pretty easy because let’s face it, it is just a filesystem that needed syncing, and a database is a lot different than a filesystem. And as far as I know we cannot just download the database file from AWS and upload it…

However, I would still like to proceed making this ‘proof of concept’ in bash because it’s easy and does not require a lot of bells and whistles to test/setup.

A short introduction to DynamoDB

When you’re reading this, if anyone at all is reading this :p, you probably already know what DynamoDB is but just as a quick recap…

DynamoDB is not your regular SQL based database like MySQL, MariaDB, MSSQL and the like. DynamoDB is “NoSQL” based. You’re probably wondering “What the hell is a “NoSQL” database? Are we using excel sheets as a database??”. Yes but no…

  • SQL is primarly a RDBMS (Relational database) whereas NoSQL is mainly a non-relational database and easier to ‘distributed’
  • SQL databases/tables have a preset layout/schema, think as in you have a table with id/user/password and you cannot just insert something in the column “ishouldntbehere”. NoSQL has a dynamic layout/schema, you can insert whatever data in whatever column you want. This diffrence in NoSQL is both powerful and scary.
  • SQL databases use, as the name says, … queries! Think as of ‘SELECT * FROM MyCatPictures’. This also means that SQL is table based. Meanwhile NoSQL datbases are document based and do not use your traditional queries.

I can go more in depth about the differences and the pros and cons, but then I will probably have lost you… But this should give you a basic view of what a NoSQL database actually is.

Docs to the rescue?!

Luckily for us, as always, the very nicely documented AWS docs might come in hand. It’s time to explore them a bit because I have no clue at this moment on how we are going to solve this case… I was thinking of making a big ‘SELECT *’ (note: this is not the correct term for NoSQL, it’s just to make an example) but this does not scale well, I think.

When exploring the linked docs I noticed something interesting, namely “Using the AWS CLI with Downloadable DynamoDB”. I got my hopes up, the birds started to chirp, the sun started to shine… I found the solution already! No… I had it wrong, I misunderstood the topic/title a bit sadly enough. But I found that we’re able to list all the tables fairly easily so that’s a win!

$ aws dynamodb list-tables{
"TableNames": [
"Forum",
"ProductCatalog",
"Reply",
"Thread",
]
}
// https://docs.aws.amazon.com/cli/latest/reference/dynamodb/list-tables.html#examples

This returns a nice JSON object which could make our work fairly easy, hopefully. But now we can dynamically iterate through all the tables at ease. We still need to find a way to fetch all our records. I started to dig deeper and deeper in the docs until I stumbled upon ‘scan and query’. We seems to be able to create a query with some basic filters (which we do not need in this use case) and get the results back, neat. But what’s the difference between scan & query? It’s rather simple, the query works based on primary key values while scan does not require them to work. However, scan is more costly in performance than a query and this should be taken into consideration!

For our usecase, scan seems to do the job just right for now so let’s go with that. So… we got our table names, the scan ‘utility’ so maybe we can start going somewhere.

Writing the ‘export’ functionality

Exporting seems rather easy now when we are combining the previous information. We are even able to make it a one liner! We are simply calling scan with the table-name parameter, we supply it our region of our instance and use the JSON output method and *poof* we got our data in a JSON file.

$aws dynamodb scan --table-name ourname --region us-east-1 --output json > export.json

Writing the ‘import’ functionality

In the code examples from AWS itself there is a nice article about loading data in tables. My brain started thinking that we might be able to use this information for our import. The article advertises the functionality as a simple one-liner, damn this looks promising! As per demo code…

$aws dynamodb batch-write-item --request-items file://ProductCatalog.json

I adjusted this command to suit my needs (as in file name) but I noticed that this was not working. Sadly enough I could not directly find the reason why and how. After some shameful hours of searching and experimenting I found out that our ‘export’ function is the culprit! The scan does not output the same format that batch-write-item needs. This set me back so far, I would need to go back to the drawing board and it somewhat demotivated me a bit.

Revisiting the ‘export’ functionality

After testing a lot of things, even using extrenal tools and writing dirty hacks I was almost going to give up… But I knew that this must be possible, I could not be the only one that had this exact same issue! After some time I found a possible way on StackOverflow, thank you stranger on the internet with sharing your widom with me! Kudos to you, you just saved me :)

The StackOverflow answer advises to use ‘jq’, a cli based JSON processor. Why didn’t I think of this, well I sort of did but I could not figure it out to do it on a feasable way… We are basically going to reformat our output to a new JSON schema!

$aws dynamodb scan --table-name my-prod-table \
| jq '{"my-local-table": [.Items[] | {PutRequest: {Item: .}}]}' > data.json

Now, we adjust the command to our needs, run it and then run our import function… It works! It’s alive! But it’s sort of ugly this behaviour but hey, it’s just to show you that it’s easily possible!

This proccess kind of felt like creating the frankenstein monster. And this is mainly why it took me a while to get this out… I tried to make this as comprehensive as possible, stitching everything together and getting it to work eventually. It is sort of shameful to say but I left out a lot of the ‘behind the scenes’ mess I made while doing this, and maybe it’s for the better too.

Source: https://me.me/i/stack-overflow-run-pr-le-0342b8a4fb2948ce8307124d18f0e483

Next up is some work with Jenkins, stay tuned!

--

--