DynamoDB

A First Look

Mauricio Strello
Globant
14 min readFeb 22, 2022

--

A view of a library from top
Photo by Tobias Fischer on Unsplash

DynamoDB, a NoSQL Data Engine

AWS defines DynamoDB as a fully managed NoSQL database service with great performance and scalability. Sounds cool, but what does all this really mean for a development team that is faced with using a NoSQL engine for the first time? In this article I want to explain to you what it means, and what you should keep in mind to approach DynamoDB in a good way, use it correctly and take advantage of it. For this, we are going to focus on understanding the advantages of DynamoDB, and why it is important to carry out an adequate design in order to take advantage of DynamoDB service.

Disclaimer: This is a short article and in no way intends to do more than give you an overview of the most important aspects for designing in DynamoDB and how to dive deeper. Please don’t think that reading a 10-minute article is enough to become a DynamoDB expert!

Let’s start at the beginning

The first thing to understand is that in the NoSQL world in general (and in DynamoDB in particular) SQL is not used as a query language (what we might suspect from the name, right?), and neither is data saved as in a relational engine. Within NoSQL engines, DynamoDB is classified as a hybrid key-value / wide-column. Simplifying a lot, this means that you can think of DynamoDB as a big hash table, only each entry in the table, if necessary, has a B-tree, so you can look up “similar” records very quickly (we’ll see what this means later).

But why should we bother learning a new paradigm if relational engines have served us well for over 50 years [1]? There are several advantages (and problems), but if I had to use only one word, it would be scalability.

[1]: Since 1970, to be more exact! You can read some history here: https://twobithistory.org/2017/12/29/codd-relational-model.html

In NoSQL (and in DynamoDB!) you design with the goal of performing the most common queries as quickly and efficiently as possible. And if you do it right, the infrastructure will maintain consistent performance, with response times of less than ten milliseconds, regardless of the amount of data in your database!

Why am I talking about design? You have surely heard that NoSQL is schemaless, right? And therefore you can put whatever you can think of in each record, right? That’s a little (and slightly dirty) secret; although there is effectively no schema per table, the reality is that you do need to think and design carefully.

Finally, what does “fully managed service” mean? This means that once you design your tables, and make some initial definitions, basically you won’t have to carry out administration tasks: you don’t have to take care of servers, scale (or downscale) your infrastructure depending on demand, worrying about disks filling up, get backups done, apply updates, among a lot of mundane but necessary administrative tasks. All this comes as part of the service, and therefore you can dedicate yourself to doing what really adds value to your customers.

Fundamental concepts

There are two fundamental concepts that you should always keep in mind when thinking about the design of your DynamoDB database:

  1. Understanding how the data is going to be queried is essential (ie, getting the data access patterns). Thus, the first step is to understand the business problems and the use cases that our engine is going to solve.
  2. As few tables as possible should be maintained (in fact, some of the top DynamoDB gurus advocate single-table designs, where all data resides in the same table!). This has a lot to do with a basic principle when modeling databases, and that is to keep related data “close together” [2].

[2]: As we will see most of the time “close together” in DynamoDB means in the same table!

In order to understand how to put these concepts into practice, we need to review how data is organized in DynamoDB:

  • Table: is simply a group of records that conceptually go together, modeling one or more entities of our domain. Contrary to what we are used to in a relational engine, different entities can be inside the same table, if it makes sense to do this to speed up common searches.
  • Item: is a single record in a table.
  • Attributes: are the individual characteristics within an item. For example, in an users table, an attribute could be their username, email, etc. There are scalar attributes (string, number , binary , boolean, and null), complex attributes ( lists and maps ), and sets (string sets, number sets and binary sets).
  • Primary Key: Although DynamoDB is schemaless, there is structure. Every table must declare its primary key, and this is by far the most important design decision in DynamoDB.
  • Secondary Indexes: A table can have secondary indexes if there are access patterns that are not satisfied by the primary key alone. A secondary index, at least in the most typical case, is a copy (partial or total) of the original table, where the secondary index becomes the primary key of this copy.

A very simple table would look like this:

A simple DynamoDB table

A couple of things to note:

  • As I mentioned before, most of the time a secondary index is a copy of the original table. A frequent observation is that, in the cloud, storage is cheap and computing is expensive, and therefore it is more profitable to store redundant information than to compute it every time.
  • And this also allows us to bring up the join issue: there are no joins in DynamoDB![3]. According to the engine designers, joins are one of the main reasons that relational engines cannot have scale and performance at the same time, and that is why they do not exist in DynamoDB. And so? Since storage is cheap, why not save the data pre-joined? But what about data normalization? Well, DynamoDB makes us forget about holy normal forms (with a few exceptions).

[3]: At least from a service point of view, you can always simulate a join at the client level, but it’s definitely not an operation you can ask DynamoDB to do.

Primary Keys

There are two types of primary keys in DynamoDB:

  • Simple primary keys: Consisting of a single element called a partition key (previously known as a hash key). The users table shown above has a simple primary key, corresponding to the username.
  • Composite primary keys: Consisting of two elements, called partition keys and sort key (formerly known as range key).

This idea that by choosing a partition key we are partitioning all the items in a table into groups is central to the scalability of DynamoDB. Here again it’s worth thinking of DynamoDB as a big hash table, with our partition key as the input to the hash function used by DynamoDB.

Likewise, composite primary keys (that is, the combination of a partition key plus a sort key) are the ones that allow us to satisfy an access pattern of “get many related records” in a single database query. With the partition key we get a collection of items that share that same value, and with conditions on the sort key we can specify a range of particular items that we want over the entire collection. In passing we take advantage of mentioning that in DynamoDB the set of all the items that have the same value in their partition key is called an item collection.

Let’s see a concrete example. To do this, we will do some iterative design: in the first iteration we are going to use a simpler design, which we will extend later since we are going to change the type of query that we want to answer.

To do this, suppose that we are working on a SaaS application; clients of this service are organizations, and each organization has registered users. Suppose we want to query for the users that belong to an organization, for this we could have a table like the following:

A first design for our second DynamoDB table

In this case, to obtain all the users that belong to an organization, we only have to obtain the item collection associated with that specific organization. For example, if we use the DynamoDB API Query, indicating that the partition key is “MolinosSanAlfonso”, this call returns all the corresponding items (in this case, our users Fabiola Salazar and Pedro Salas).

Note: one of the things that we haven’t mentioned yet is that in practice, all the conditions to be able to filter and search must be contained on the primary keys or secondary indexes. Although we can filter by normal attributes, these filters act on the service side once the information is extracted from the storage, and before being sent to the client, so they are not mechanisms that we really want to use (if we want to be efficient in time and money at least!). Therefore, everything that has to do with efficient search is done at the level of primary keys and secondary indexes.

This is our first design, however, let’s extend the example a bit (and change the design accordingly), to see how we can work without joins.

So, let’s suppose that we know now that the most frequent query of our application is about the organization information, which must include a list of all its users. One way to do this is to add a table of organizations to the one we already have for users. However, remembering that we do not have joins , this requires us to obtain the information of the organization and its users separately, that is, two independent searches linked through logic in our application. To improve this, we can introduce another strategy, common in DynamoDB, called key and index overloading. If we place several different entities in the same table (and thus make a single query to the base), this implies that the common attributes (that is, those that make up our keys and indices) will have different meanings, depending on the entity (that’s why we talk about overload).

Let’s see what this means in practice, and for that we’ll use the following design (where PK and SK are generic names for primary key and sort key, which is a well-used convention by the way):

Second design for our second DynamoDB table

Wow! We have some changes, let’s review them in a little of more detail:

  • First of all, we see that the table has several entities, since it contains not only the users, but we also store information about the organizations themselves. Each type of entity has its own attributes, and the common attributes (PK and SK in our case) have different meanings for each type of entity.
  • What is marked in blue is the item collection corresponding to the value “ORG#MOLINOSSANALFONSO”, or, in other words, all the items that have the same value for the partition key. However, in this case we see that we have now managed to associate different entities: on the one hand we have the organization’s own information (what it is called and what plan it has contracted), and on the other hand all the users that belong to that organization.

In this way, we can finally see what we mean when we talk about “pre-binding” the associated values. In a relational engine, this would have been modeled as two separate tables, with an SQL query using a join operation to merge the two tables based on the common attribute (the organization name or identifier in this case).

It is valid to ask, what is the profit? We had to do a redesign of the table, and we ended up with a table that looks “weird”, isn’t it much easier to have a relational engine take care of everything, isn’t it? Well, as always, it all depends on what we are looking for. By doing it this way what we gain is scalability and consistently good performance, regardless of the size of our base. Also, I promise you that as you make more similar designs, the tables will not seem “weird”.

This design allows us to solve several problems, for example:

  • If we want to obtain a particular organization, it is enough to obtain an item via its complete primary key (for example, PK = SK = “ORG#MOLINOSSANALFONSO”). To do this we use the DynamoDB API called GetItem with the appropriate parameters (we use GetItem when we want a specific item and Query when we want a set of related items through their partition key).
  • For the query that we said was the most used (that is, the complete information of the organization, including all its users), it is enough to use the Query DynamoDB API with the PK = “ORG#MOLINOSSANALFONSO” condition.
  • If we wanted a specific user, for example, “Pedro Salas” from “Molinos San Alfonso”, we should use the GetItem with the condition PK = “ORG#MOLINOSSANALFONSO” and SK = “USER#PEDRO_SALAS”.

Secondary indexes

If we remember, we use secondary indexes for access patterns that we cannot satisfy through the primary key alone; and effectively, a secondary index is, depending on the type of index, a copy of the original table with only the data necessary for it. make the query (or queries).

There are two types of secondary indexes in DynamoDB:

  • Local secondary indexes: The same partition key, but a different sort key. In this case there is not a copy of the table, but there are certain DynamoDB limits to be aware of (no more detail here because we simply have not seen some specific DynamoDB implementation concepts, which are necessary to explain why those limits exist).
  • Global Secondary Indexes: Entirely different attributes are used for partition key and sort key, resulting in a copy of the table (partial or full). This is another example of the underlying “storage is cheap ‘’ philosophy, and they are used quite more than local indexes because of the added flexibility.

The ideas and strategies present for secondary indexes are the same ideas as when we modeled a primary key, but we need a new construct to resolve additional access patterns (over those that are resolved via the use of the primary key).

A recurring strategy is to use secondary indexes to model N-N relationships, viewing the N-N relationship as two 1-N relationships, where we use the primary key for one of the 1-N relationships, and use a global secondary index to model the other 1-N relationship.

An example of this of a N-N relationship is the existent between movies and actors, where one actor can be in many movies, and one movie can be in many actors. If, for example, the queries that we must answer are “give me all the actors that participated in a movie” and “give me all the movies that an actor participated in”, then we could have a primary key composed of “<id_pelicula> + < id_actor>” (which returns all the actors that participated in a movie) and have a global secondary index “<id_actor> + <id_pelicula>” (which returns all the movies in which an actor participates).

DynamoDB API

We have already seen some examples of the APIs with which we can access a DynamoDB database, now we are going to list all the available operations:

  • Actions related to a specific item: DynamoDB was born as a key-value data store, and this can be seen in the API. All these actions configure the typical CRUD actions on a single item, and need the complete primary key as a parameter:
    - GetItem: get a particular item.
    - PutItem: Create or overwrite a particular item.
    - UpdateItem: Create or update a particular item.
    - DeleteItem: Delete a particular item.
  • Actions on more than one item:
    - Query:
    obtains an item collection, that is, a set of items that have the same value of the partition key (optionally filtering via the use of a condition on the sort key).
    - Scan: This action basically goes through the entire table, bringing all those items that meet a certain expression that involves attributes not included in keys or indexes. It should be used only as a last resort or on certain special occasions (for example, on small tables where it doesn’t make sense to create a special index for a very infrequent query).

Like all AWS APIs, HTTPS is used as the transport protocol, which is another difference from relational engines that have specialized communication protocols. Of course, there are libraries for the most common languages ​​so it is not necessary to use the API directly and, depending on the language, different constructions and abstractions are offered to facilitate the developer’s work.

Platform

Until now we have not talked about the technical aspects of the service, and how this implementation allow scalability and performance.

As other NoSQL engines (or even services like Apache Kafka, which are not data engines but they offer a persistence layer), the magic word is scale-out . Horizontal growth is associated with the concept of partitioning that we have seen as central to DynamoDB. Having to choose and use a partition key for almost all operations is what allows DynamoDB to allocate dedicated infrastructure (not just storage, but compute and communications) to distinct subsets of data.

Again it helps to think of a DynamoDB table as a big hash table. Armed with our partition key we can instantly locate where the corresponding items are located, regardless of the size of the base.

So, finally we know behind DynamoDB there are clusters of servers, each with compute, storage and communication resources, dedicated to serving specific partitions of our data. If the data grows, DynamoDB adds more nodes to the cluster, allowing linear scaling (remember that accessing a specific partition is an operation that does not depend on the size of the database, thanks to the use of our partition key).

The final advantage of DynamoDB is that all of this is transparent to the user, and everything is taken care of by the service.

When to use DynamoDB

A very important question is when to use DynamoDB. It is not my intention to say that DynamoDB is a good choice for every problem, although it can be used for most problems with careful design. Rather, I will try to provide some criteria that allow us to decide if it is a good match for our needs:

  1. It is a transactional application. DynamoDB is not well suited for analytics projects.
  2. All data access patterns can be determined from the outset. If there’s a lot of uncertainty in how we’re going to query the data it’s probably not a good use case for DynamoDB.
  3. There is a preference to use serverless services, or at least, it is an acceptable option.
  4. An application which will need to support a strong growth, maintain consistent response times, and with simple data access patterns is an ideal application for a first project with DynamoDB. Of course, with more experience, almost any problem can be tackled with DynamoDB.

To finish

Wow, we reached the end! While we’ve barely scratched the surface of DynamoDB, I hope I’ve achieved my original intent, and shared with you those fundamentals necessary to successfully understand and use DynamoDB.

As a summary, I will leave the most important ideas that I wanted to convey to you:

  • DynamoDB is a non-relational database service, which scales very well and presents consistent performance, if our design respects certain rules.
  • What you must to do to achieve a proper design in DynamoDB is:
    a. Understand all the data access patterns that our application needs.
    b. Decide which are the tables that we will use, and the primary keys (and secondary indexes if necessary) that allow us to answer the access patterns that we raise.
  • Corollary: don’t expect that if you design with a relational engine in mind, you’re going to get something that works well in DynamoDB!

There are many topics that we have not reviewed: data consistency, backups, monitoring, optimization, streaming, migrations, more advanced design patterns, among many others. However, I hope I have told you enough to pique your curiosity and want to continue investigating. For that, here are some resources I suggest:

--

--