Sitemap

Serverless 2 — DynamoDB

14 min readJan 4, 2024

In the previous blog, I wrote about Lambda which is a function with that you can build anything including backend APIs and jobs. It is already deployed and production-ready in the cloud in a highly-available, scalable, and secure way. You just mind your code and the rest is handled by AWS.

The next thing you would need is permanent data storage or a database. Relational databases can be used but they are expensive and not scalable as Lambda. The best database for Lambda is DynamoDB which can handle trillions of transactions a day. One of the most performant and hyper-scaling databases in the world. As it is serverless, you won’t pay and will end up in the always-free layer if your application is smaller. If your app gets bigger, you can find investors.

NoSQL

NoSQL means Not Only SQL or Not Relational Database. It is a broad concept. There are multiple types of NoSQL databases:

  • Key-value: Key-value databases are highly partitionable and allow horizontal scaling at scales that other types of databases cannot achieve. Best fit for transactional applications. A transactional application means you can store, update, delete, and get data by key (except querying) fast. But it is slow and not an option if you want to run some advanced queries like SQL queries with a complex WHERE clause. Because in key-value databases, it scans every single record to filter records. Read performance gets worse as time goes on since a record is added every single day.
  • Document: Document databases are nothing but it stores JSON such as MongoDB.
  • Graph: A graph database’s purpose is to make it easy to build and run applications that work with highly connected datasets. Typical use cases for a graph database include social networking, recommendation engines, fraud detection, and knowledge graphs. I developed a social networking app for fun using DynamoDB. I hit the limit and realized sometimes traditional databases are just not enough and not efficient in some cases such as finding mutual friends and generating newsfeeds.
  • In-memory: Stores data in memory rather than hard drive. They are more performant. Amazon ElastiCache service offers Memcached and Redis which are popular caching technologies. At the same time, they are also databases that can store data permanently.
  • Search: You probably heard of Elastic Search or Open Search. That is what it is. Purpose-built for providing near-real-time visualizations and analytics of machine-generated data by indexing, aggregating, and searching semi-structured logs and metrics.

DynamoDB is both a key-value and document database. It provides consistent low two-digit millisecond latency for any scale of workloads.

We developers love DynamoDB because of its flexibility. You just define keys and the rest of the columns are dynamic. You can add a new column at any time. Whereas in a relational database, you have to write a DDL (Data Definition Language in SQL) script to add a new column. But too much flexibility could be bad as some records could have dozens of attributes whereas other records have only two attributes. To address that, you may want to validate the request body before storing it. TypeScript helps a lot with it. Because TS only stores attributes defined in the interface and omits attributes that are not in the interface.

Designing a NoSQL database

This is the most critical part of software application development. Due to incorrect design, your app can get expensive and slow. There are some settings that you can set only at creation time. It is hard to change the database structure later on. Migrating data in production in real time without any business interruption and data loss is difficult.

Relational databases are invented more than a half-century ago. Things were different at that time. Storage was expensive and scaling was not a priority as there are no apps that have millions or billions of users. Normalization is there to prevent data redundancy and inconsistency.

In 2023, storage is cheap and scaling is more important. The biggest issue with relational databases is performance. Joins make your app slow. How can we address this? Have no joins, just store the entire data you see on one page in one table. If I ask you to design a Facebook-like app, you might do normalization and create tables like posts, comments, reactions, and so on. I assure you that database design won’t work at scale. Instead, store the entire thing in one table (no joins). Call just one API endpoint to show the page. You got the idea.

That doesn’t mean you always store the entire data in one table though. There is a limit on how much data you can have in one record or item in DynamoDB which is 400 kb. Whereas in MongoDB, it is 16 MB.

When you update an item in DynamoDB, it charges based on the item size. Let’s say that you have an item of 300 KB and updated a flag from false to true. It is boolean and the actual value you updated is only 1 KB. But you pay for 300 KB data. Keep that in mind.

There is no silver bullet in designing. There is no right or wrong. It depends on the app. In my opinion, it is a great design if it works at scale and is simple.

Here are some principles I follow when designing in DynamoDB. You can employ the same principles when designing similar databases such as Casandra.

  • Storage is cheap nowadays and it is okay to have duplicate data in your database.
  • In databases like DynamoDB and Cassandra, you design it based on the data access pattern. It is the most important point in this article. Let me repeat, you design it based on the data access pattern.

Let me give you an example from my real-life projects. Let’s design a couple of apps. The first is a Medium-like app. You can post and readers should be able to find posts by tag. The access pattern here is ‘by tag’. You search by a tag. It will take O(n) time to retrieve posts by a tag if the tag is an ordinary column. If there are billions of posts, it will take forever to get that. Also in DynamoDB, you have to pay for every single record you scan. Instead, we could make the tag a primary or partition key. You will never do something like this in the SQL world. But indeed, we do it in the NoSQL world. It is exactly how I did it in a real-life production app and it works efficiently. Here is my schema.

Press enter or click to view image in full size

We still need postId though to make a record identifiable. When users search posts by tag, the app shows a list of posts. On the home page of Medium, you see a list of posts that could be stored in one table like PostAndTag. As the tag is the partition key, it will take O(1) time regardless of number of records. Your Medium-like app will respond in low two-digit ms when looking at posts by a tag even when you have millions and trillions of records. When the user clicks on the post to read, we will get post details by the postId. Here is some sample data.

As you see, data is duplicated. That is ok. There is always a tradeoff when designing. We sacrificed some storage to achieve constant fast performance. In this design, you probably want to limit the number of tags per post. That is exactly what Medium does, you can have up to 5 tags.

To make it even clearer, I got the second example. Let’s design a chat app. I am sure you use Facebook Messenger. When you open the app, it shows a list of your chats. Meaning, it is pulling chats based on your user id. The access pattern is ‘by user id’. Once you click on a specific chat with a friend, it shows messages in that chat. There the access pattern is to get messages ‘by chat id’.

Press enter or click to view image in full size

To make it simpler, I will make 2 assumptions. First, it supports only one-on-one chat. Second, you have a microservice already there in place that generates chatId. Some sample data:

When you create a record in the chat table, you duplicate the record by swapping IDs. The 2 records share the same chatId.

Press enter or click to view image in full size

It may not be clear at this moment. Take your time and take a look at sample records and schema that will explain itself.

DynamoDB concepts

DynamoDB is a distributed database. When you create records in a table that are redundantly stored for high availability across many servers. In a relational database, you have one single write server that you can scale up whereas DynamoDB is horizontally scaled. AWS dictates on which server to store the record by partition key. DynamoDB uses the partition key’s value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. The way DynamoDB is just like how Map data structure works in programming languages.

Press enter or click to view image in full size
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html

In the example above, the partition key is AnimalType. There could be many dogs. To identify a single record or item, you need another key, called the sort key. As the picture shows, records in the same partition are sorted by the sort key. In this example, the Name is the sort key. The sort key is optional. The sort key is also used to represent the one-to-many relationship. Under the hood, DynamoDB can use a sort key for parallel processing to achieve better performance.

One of the mistakes I made was that I created a News table and defined unique newsId as PK and insertDate as SK to get all items in sorted way. That is all wrong. Because newId is unique meaning there will always be one record per newsId. So sorting is not applicable as there always is one record. Remember, records are sorted by sort key if many are behind the same partition key. See the examples above. If you need such sorts, consider exporting your data through DynamoDB streams and using other services to do sorts and analysis.

The primary key is a column or columns that identify a single record. If there is no sort key, then the partition is a primary key. If there is a sort key, then both the partition and sort keys are primary keys or composite keys.

Read/write capacity units and Pricing

Read and Write Capacity Units dictate how much your table can handle and cost. One RCU (Read Capacity Unit) can read 4–8 KB based on the read consistency. One WCU (Write Capacity Unit) can write 1 KB data.

1 million WRUs cost $1.25 and 1 million RCUs cost only $0.25. 25 GB of data is always free. Backup, data restoration, global tables, export data to S3, and DynamoDB Accelerator (DAX) are billed separately.

DynamoDB supports eventually consistent and strongly consistent reads:

  • Eventually Consistent Reads — When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data. This is because DynamoDB is a managed service that replicates your data across multiple servers and data centers on your behalf for high availability and redundancy. AWS can replicate your data in a very very short time about low 2-digit ms. The Eventually Consistent Reads is the default and recommended option.
  • Strongly Consistent Reads — DynamoDB can support strongly consistent reads if your business cannot tolerate low 2-digit ms latency. When you request a strongly consistent read, DynamoDB always returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful in all nodes. However, this consistency comes with some disadvantages, higher latency and more throughput capacity meaning more expensive.

Amazon DynamoDB has two read/write capacity modes for processing reads and writes on your tables:

  • On-Demand Mode — it is a flexible billing option capable of serving thousands of requests per second without capacity planning. An on-demand mode is a good option if there are unknown workloads and unpredictable traffic.
  • Provisioned Mode — You specify the number of reads and writes per second that you require for your application. You can use auto-scaling to adjust your table’s provisioned capacity automatically in response to traffic changes. With provisioned concurrency, you have more control over your table and cost whereas on-demand can scale independently which could cause a surprising bill.

Index

Just like any other database, the index is available in DynamoDB. It increases the performance of searching by a specific column. Under the hood, it creates another table with a copy of your data. In that hidden table, the search column is the primary key and the actual primary key is mapped. If you have one index, it doubles the cost. If you have two indexes, it triples the cost. It is not advisable to have many indexes in a single table.

Actual Table — https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
Hidden Index Table

In this example, the primary key is userId and the column you search faster is gameTitle.

Highlighted features

DynamoDB focuses on and does one thing very well. The creators kept only one thing in mind when designing DynamoDB, transactions. It is an online transaction processing (OLTP) database and doesn’t have many features. However, there are 2 powerful features I want to highlight and you must know.

DynamoDB streams

DynamoDB Streams capture data modification events in DynamoDB tables. For example, a new item is added, updated, or deleted. Then based on those data changes, you can build apps like a real-time dashboard. This data can be processed by Lambda before Time To Live (TTL) reaches.

Press enter or click to view image in full size
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-streams.html
Press enter or click to view image in full size
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.Tutorial.html

An example payload when there is a new item in the table:

{
"Records": [
{
"eventID": "7de3041dd709b024af6f29e4fa13d34c",
"eventName": "INSERT",
"eventVersion": "1.1",
"eventSource": "aws:dynamodb",
"awsRegion": "region",
"dynamodb": {
"ApproximateCreationDateTime": 1479499740,
"Keys": {
"Timestamp": {
"S": "2016-11-18:12:09:36"
},
"Username": {
"S": "John Doe"
}
},
"NewImage": {
"Timestamp": {
"S": "2016-11-18:12:09:36"
},
"Message": {
"S": "This is a bark from the Woofer social network"
},
"Username": {
"S": "John Doe"
}
},
"SequenceNumber": "13021600000000001596893679",
"SizeBytes": 112,
"StreamViewType": "NEW_IMAGE"
},
"eventSourceARN": "arn:aws:dynamodb:region:account ID:table/BarkTable/stream/2016-11-16T20:42:48.104"
}
]
}

This concept is similar to BinLogs in relational databases and Change Streams in MongoDB.

Global table

I needed to develop a highly redundant API at Sterling and the architect team suggested us create a global table. I was amazed when I saw how it works with my own eyes. Global table replicates data in almost real-time across multiple regions.

When you create a table, you can enable this option. If so, you have to select 2 other regions and your data will be replicated in all three regions automatically. For example, when I store an item in Canada, the same item is created in the same table in Australia near real-time automatically. AWS uses DynamoDB Streams to replicate data in multiple regions under the hood.

Global tables build on the global Amazon DynamoDB footprint to provide you with a fully managed, multi-region, and multi-active database that delivers fast, local, read and write performance for massively scaled, global applications. Global tables replicate your DynamoDB tables automatically across your choice of AWS Regions.

https://aws.amazon.com/dynamodb/global-tables/

Other features

DynamoDB TTL

When you delete an item in DynamoDB, it costs since it uses a Write Capacity Unit under the hood. In serverless architecture, every API call incurs a charge. TTL (Time To Live) is a way to delete an item at no cost. I also took advantage of TTL in my real-life project when sending an OTP (One-Time Passcode) that expires in 30 minutes. The default TTL in the DynamoDB stream is 24 hours.

DynamoDB TTL is a cost-effective method for deleting items that are no longer relevant. TTL allows you to define a per-item expiration timestamp that indicates when an item is no longer needed. DynamoDB automatically deletes expired items within a few days of their expiration time, without consuming write throughput.

SQL on DynamoDB

You can use PartiQL (a SQL-compatible query language) in DynamoDB operations — to query, insert, update, and delete table data. It can help improve the productivity of developers by enabling them to use a familiar, structured query language to perform these operations.

SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%'
AND Price < 1.00;

Please notice it is not SQL. You cannot run any SQL queries on DynamoDB. Rather, it is an abstraction on top of DynamoDB functions that look like SQL.

DAX

Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available caching service built for Amazon DynamoDB. DAX delivers up to a 10 times performance improvement — from milliseconds to microseconds — even at millions of requests per second.

It is just a caching server in front of a DynamoDB table and its pricing is similar to EC2 which charges every minute it runs.

Press enter or click to view image in full size
https://aws.amazon.com/dynamodb/dax/

ACID transactions

Amazon DynamoDB transactions simplify the developer experience of making coordinated, all-or-nothing changes to multiple items both within and across tables.

Transactions provide atomicity, consistency, isolation, and durability (ACID) in DynamoDB, helping you to maintain data correctness in your applications. It is very handy in real-life applications. There are 2 functions:

  • TransactWriteItems — group multiple Put, Update, Delete, and ConditionCheck actions
  • TransactGetItems — multiple Get actions

DynamoDB Backups

There are 2 types of backups in DynamoDB:

  • On-demand backups — You can use the DynamoDB on-demand backup capability to create full backups of your tables for long-term retention and archival for regulatory compliance needs.
  • Enable continuous backups using point-in-time recovery — You can restore that table to any point in time during the last 35 days. DynamoDB maintains incremental backups of your table

Analysis

DynamoDB is designed for transactional applications. If you need to do analytics like sorting and searching by multiple columns, consider exporting your data through DynamoDB streams and using other services to do analysis such as OpenSearch. When you design apps, categorize apps into 2 groups, transactional and analytical. Use DynamoDB for transactional apps. Use other services for analytics.

Press enter or click to view image in full size

My know-how

  • DynamoDB scans/queries and returns records 1 MB by 1 MB. That is how you can do pagination. This means pagination is supported out-of-box. You don’t have to do anything.
  • In scan, there is a “limit” parameter that dictates how many records you want to scan. Not, the number of records you receive as a result like SQL.
  • There is no date type in DynamoDB. We store the date as a Unix timestamp. Then records can be sorted by date nicely and correctly if you are using that date as a sort key.
  • DynamoDB doesn’t have many features. It focused on hyperscaling transactions. If you want to do something extra like analytics such as sorting or searching, then there is DynamoDB Stream that can export your data to data analytical services such as S3 and OpenSearch.

--

--

No responses yet