Getting Started with DynamoDB

Published in

Imagine Learning Engineering

11 min readJul 13, 2020

DynamoDB is a big topic. So we’ll start with a short description of what we’re working with and how to design a schema. Then we’ll run a hands-on example. For this tutorial, you will need Docker installed.

DynamoDB is a fully managed NoSQL database. It’s hosted on AWS, scales amazingly well and is cost effective. That’s a pretty simple yet loaded statement. The AWS docs go into more detail on those terms and more. We’re going to explain how it works. When using DynamoDB (a NoSQL database) you need to think about your data a little differently than you would if you were to use a relational database. For instance, you only need one table for all of your domain specific data. That table can hold many different types of data. You don’t use foreign keys or joins. Instead of rows we say items. Every item needs a primary key. This can be a partition key, or the combination of a partition key and a sort key. DynamoDB will group your items by the partition key. These partitions will greatly affect performance. For even more details on the differences between NoSQL and Relational databases see “From SQL to NoSQL”.

Let’s look at an example.

Say you need to track patients that come to see the doctor. Each patient, doctor and visit is unique. It’s best if we can properly structure this data before it goes to production. Once we start populating the database with data, we’ll need to run a conversion if we ever want to change it. To that end we’ll list all the ways that our data could be accessed before creating our table. Here’s a short list of questions that our data will need to answer.

How many appointments did a patient have last week, month or year?
How many times has a patient been to the office?
Does the patient make appointments with the same doctors each time?
Are the patients happy with their medical care?
Which doctors see the most patients?

Based on these questions we could have the following item types

Patient appointment
Patient cancellation
Patient survey
Doctors appointment

Here’s a diagram, of these item types, created with the AWS NoSQL Workbench.

The partition and sort key together make our primary key. In the example above, we’re setting the partition key to a PatientID, but we’re not restricted to this. If we needed an item based on office location, we could have a partition key of OfficeID. We could put that in the same table and it wouldn’t be a problem or a challenge. The sort key describes the item type and then the effective date that applies to the item. The sort key goes from general to specific. For instance Appointment#<AppointmentDate>. All dates here are in UTC. We also could have added a CreatedAt datetime and a ModifiedAt attribute. This would allow us to see when an appointment was made and the last time it was updated.

Take a moment to notice the second patient in the list, patient P4855RW. They had two appointments and rescheduled one of them. That leaves us with two appointments and one cancellation. These three items are all on the same partition and all apply to the same patient. We should only get data from one partition at a time. This restraint makes schema design very important. There are many videos on this. I like AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304) and AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401).

Let’s start working with DynamoDB!

We’re going to start by running DynamoDB locally in a docker container. There are many ways you can do this. Today we’re going to run a simple container.

docker run -p 8000:8000 amazon/dynamodb-local

Note: Docker on windows will leave this running even if you Ctrl+C out. I run docker ps to get a list of processes and docker kill <pid> to stop the container.

Now that Docker is running, go to http://localhost:8000/shell/ you should get something that looks like this:

There are other GUIs out there for you to use. I like this one because you have control through code and it has autocomplete. The autocomplete will give you all the options for a given operation. It can be overwhelming, but it’s great to know what’s available.

In the editor type list and hit Ctrl+Space. It should pop up an autocomplete window. Choose listTables (all) and click the arrow to run it.

We don’t have any tables yet so TableNames, listed on the Console, is an empty array. Let’s create our DoctorAppointments table. Run:

Now if we list tables, we get

Simple CRUD Operations

In DynamoDB we put, get, update and delete items. If you put an item where the PK and SK match with an item that already exists then you overwrite the existing item. This is because each item has a unique key. This is a combination of the partition key and the sort key.

There are many options that we can use in our CRUD operations. For a detailed list, use the autocomplete and see Working with Items and Attributes.

Put

Here’s a long list of items to put, you can paste this in and run it once. Having this data, in our table, will allow us to do more operations later. Notice that each operation consists of a params definition and then a call that uses those parameters.

If you look at the put command, the only params required are TableName and Item. I included the optional param of ReturnConsumedCapacity to show how much each item consumed. Without it, you’d see a blank box for each put command and I like more of a visual confirmation. Try it without ReturnConsumedCapacity declared and see what it does. Also notice that I’m referencing a common variable of table for the TableName. That’s what I’d be doing in my applications and I’m mirroring that here.

Get

Now, get one of the items we’ve added. First try the autocomplete for get, then I’ll post an example below. TableName and Key are required for a get command.

Update

Here’s an update example. We’re changing the cancellation reason from Conflict to Vacation. We're also adding a new attribute called Staff. This would be the person that took the cancellation or moved the appointment to a different time. If we needed to update every CancelledAppointment item, to add the Staff attribute, we’d either do it with code and handle both situations where Staff was present or wasn’t, or we’d go through and update every CancelledAppointment record using a table scan and an update. If you go the table scan route, you would run that operation after production starts recording the new data. If you do it before then you’ll have to do it again to catch anything that was entered between your update and the release.

It’s also good to know that updating a key value, either the partition key or the sort key, requires creating a new item with all its attributes. After which, you would remove the old item. We are doing this when an Appointment is cancelled and becomes a CancelledAppointment item.

In the following example the ReturnValues parameter is optional.

Delete

How you’d run a delete. ReturnValues, ReturnConsumedCapacity and ReturnItemCollectionMetrics are all optional.

Now might be a good time to talk about cost, each operation you send has a cost associated with it. DynamoDB charges for reading, writing and storing data. If you specify ReturnConsumedCapacity: ‘TOTAL’, you can see the costs associated with your queries.

You can reserve a read/write capacity. You would be charged for that capacity regardless of how much of it is used. This is called a Provisioned Capacity. If you go over that reserved capacity, then you could have errors where some data isn’t written or a reads could fail. The other option is On-Demand, where you’ll pay a flat rate for your reads and writes. There are more costs involved, I recommend looking at the pricing calculator for more information.

One of the more expensive things you can do is run a table scan. This will look at every item you have across all your partitions. The more items you have the more capacity you’ll use. We only have 9 items right now, so this isn’t a good example of the cost that could be incurred. Here, I’m including ReturnConsumedCapacity to show the cost involved.

You can also add a FilterExpression to your table scan. It’ll still cost the same but you’ll only get the results you want and that can speed things up. For more on table scans see Scan in the AWS documentation.

We can also get more than one item at a time with a Query. We specify a condition expression where we explain what we’re looking for. We’re going to use begins_with to select all the appointments for a patient. For more ways to query our data see Working with Queries in DynamoDB.

With this query, we should get two items. Please note that you can’t query an attribute using an ends_with or a contains. This is why we order the values in our sort keys from general to specific. Try it out the query with SuveyResult instead of Appointment. Try adding more appointment items to patient P283GH. Put them in December of 2020. Then specify the year and month by running begins_with Appointment#2020–12. You won’t see the appointment for November of 2020.

Global Secondary Indexes (GSIs)

Let’s look back on the questions that our DynamoDB table needs to answer.

How many appointments did a patient have last week, month or year?

We will use a query to see this. The PK needs to be the PatientID. The sort key gives us our range. We can get everything for a year by running begins_with and using Appointment#2020. Get everything for a month by adding the month Appointment#2020–10.

To get a week of data, we need to switch from begins_with to between. In the KeyConditionExpression use SK between :skStart and :skEnd. Update the ExpressionAttributeValues setting :skStart to Appointment#2020–10–01 and :skEnd to Appointment#2020–10–08. The end date is inclusive, but our dates include a time stamp, so we’ll only get seven days of data.

How many times has a patient been to the office?

Use a query where PK = PatientID and SK begins_with Appointment

Does the patient make appointments with the same doctors each time?

Here we’d have to look at all the Appointments, for a patient, and see which doctors they’re seeing. It’s the same query where PK = PatientID and SK begins_with Appointment.

Are the patients happy with their medical care?

Use a query where PK = PatientID and SK begins_with SurveyResult.

We don’t know how to answer the last question.

Which doctors see the most patients?

We could add this information using a PK of DoctorID and then tracking all the appointments from the doctor’s point of view. This would require us to duplicate the appointments every time one is made and update them when one is modified. I’m sure you’ve already guessed that there’s an easier way to do this. We can use a Global Secondary Index or GSI.

A GSI allows us to create an index on an attribute. It basically gives us an additional partition key that we can use in our gets and queries. When we put data, the index is updated automatically. GSIs have their own read/write costs. As an index, they copy data, so they add to storage cost. Adding a GSI to an existing table also takes time. Wait for it to be created before referencing it. Check out the Best practices for Secondary Indexes on the AWS Documentation for more on GSIs.

We’re going to create a GSI where the partition key is the DoctorID. Normally, you can add an item, with any attributes you need to, without defining all the attributes on the table. When you create a GSI, you need to define the attributes that the GSI will use in its key schema. Notice that, in the following definition, we define the DoctorID attribute before we reference it in the GSI index creation.

To make things simple we’ve included all attributes in our index. This was defined in the Projection > ProjectionType setting above. Using all attributes makes a copy of all of our data. It will slow down writes because it needs to write to our table and to our index. If we knew exactly what we needed, we could have limited our GSI projection to specific attributes.

To query the GSI, declare the index you’ll use by adding it to the params for your query. We’ll do this by adding IndexName: ‘index_doctor’.

Here’s a query for all records for one of our doctors.

A quick word on Local Secondary Indexes (LSI). You may have seen a different kind of index in the autocomplete or documentation. The local secondary index allows for a different sort key to be used with your existing partition key. You can do the same thing with a GSI where the GSI’s partition key is set to your table’s partition key. In our case that would be PK. The problem with LSIs is that you must declare them when your table is created. And they have a storage limit of 10GB per partition key. It’s recommended that you avoid LSIs.

This is also a good time to talk about limits. Every operation has a limit associated with it. DynamoDB will only return 1MB of data per request. If there’s more data available for your query it will send back a LastEvaluatedKey. This will tell you that you have more results to get. You take the LastEvaluatedKey and add it to your query. When you do, you’ll change the name from LastEvaluatedKey to ExclusiveStartKey. ExclusiveStartKey is an optional parameter. We can demonstrate how this works by setting our own limit on a query. We’ll add the optional param of Limit to restrict how many items we get back from this query.

The result is

Take the LastEvaluatedKey, rename it to ExclusiveStartKey and add it to the query.

You would keep going with this until the results don’t include a LastEvaluatedKey. For more on limits see Service, Account, and Table Quotas in Amazon DynamoDB.

Conclusion and a Challenge

I wrote this as a primer to DynamoDB. There’s so much more to learn. We didn’t go over connecting to DynamoDB in your application or all the libraries that are available in the myriad of languages you can use. Here’s a short tutorial that connects DynamoDB to a .NET Core app.

Starting DynamoDB, through a simple docker run command, is for testing only. The data isn’t persistent. You can however run it through localstack or use the amazon/dynamodb-local image and get persistence. I use the following in my development docker-compose.yml file. The data is persistent, and is stored in a file in the ./data/dynamodb folder.

We learn by doing, so here are some challenges:

We could expand the example doctor visits problem. Think about what data you’d need if you were to schedule an appointment? Would you use the doctor GSI? Or add more data to accomplish this? How would you get the survey results for a given doctor? What other data questions would apply to an app like this one?
Design a schema for an application. You could take your favorite game and determine what you’d need to store in order to make your game data persistent. You can use the NoSQL Workbench (which has a learning curve to be sure) to document your design.
Another schema idea, run an online store. Track your inventory and sales.
Create a simple app to track birthdays. Allow groups of users to share their data.

I hope this has been of value to you. For further reading I recommend the DynamoDB Guide and the AWS DynamoDB Documentation. Thanks for reading and all the best with your data endeavors!