What I wish I knew before using DynamoDB

Considering to go for DynamoDB? You should know these search limitations beforehand.

Shaung Cheng
CodeX
8 min readJul 8, 2021

--

Partitioning of data, one of the ways to build distributed datastore system

DynamoDB’s on-demand billing is very, very tempting. You get charged per request, not the 24/7 server up time. No need to choose an instance size— you get the almost “infinite” scalability for free. As a NoSQL database, DynamoDB is document-style, no-schema object store. Everything seems much much flexible and simpler.

But I got surprised when I found I can’t search data the way I used to do it. I came from Postgres, accustomed to linking columns between tables, storing structured, compacted data. When I tried to search in DynamoDB, there’re so many quirks and limits — why can’t I do something that is so trivial & common?

Without further ado, let’s look at what to watch out when you query in DynamoDB. I wished I knew beforehand. I’m sure they’ll blow your mind (mine definitely did).

Annoying limitations on query

Spoil alert: it’s more of inconvenience than limits.

Limit #1: You can’t change a field if you need to search by it

(To be more accurate, for “search by” I mean querying by the field)

In SQL, you query by WHERE field1=value1 AND field2=value2 AND field3=value3 .. and as many fields & conditions as you want like it’s entirely free. Of course, you go head update field1 or field2 of that entry in the table if the app needs it, what’s the big deal with that? Except primary key, I feel entitled to have full control over any field values!

You can’t do that in DynamoDB. If you’re gonna query by field1 , you better plan field1 to be read-only — once the object created, you cannot update this field! What? Ughh!

Feel so constrained and suffocated? There’s one small window for fresh air: you can remove field1 or add it back. That’s the only mutating operation you can do on a queryable field (I use the term field as it seems more common in SQL, but DynamoDB uses the term “attribute” more often; they’re interchangeable in this post). This is called the “Sparse Index” technique. There’re some important indications of Sparse Index. If the field value needs update, you can only query in a binary, boolean way —query either the field is there, or not there.

You ask: “Shouldn’t this be a must-have feature?! I’ll need to update a field, AND we need to query by it.” For example, you have a field called status. Its value could be Running , Failed or Completed , this field must be able to update, and we need to query on status for app’s search function to work! You can, Sparse Index can serve this need but extra work is needed. It works like this: you create another field called isRunning , isFailed or isCompleted , you can now query on these fields to search for the items you want. Of course, you’ll need to maintain these additional fields by removing or adding them as you update status value.

Limit #2: Only up to two fields are searchable

What’s even worse: one of them can only do “exact match” search, and you have to always specify this one when query, can’t be optional! This is the “partition key”, like a primary key in SQL.

You actually only have one field allowed for flexibility, the “sort key” — you can use all the great = , > , < , BETWEEN , STARTS WITH operators, …, and it’s optional. But remember, only one field like this is allowed for query! Seriously?!

Because of this huge limitation on query, you are forced to combine various alternative ways, try everything you can do, as long as it gets the data your app needs within a reasonable time. Yes, now your DynamoDB schema design needs to brainstorm like hell.

Alternatives I know so far (this list may grow as I keep learning from my journey)

  • After querying by the 1~2 fields, does the result size go down to just a few items? If so, it’s now appropriate to use a scan to filter on further more fields.
  • Do we really need to query by that many fields? The app probably don’t need to filter on everything. Give me more details on the app access pattern, or even negotiate it, as long as the important users needs are served
  • Create multiple GSIs (Global Secondary Index), so even if you can only query on up to two different fields in a GSI, hopefully with multiple GSIs, they can cover most of your app access pattern (still, query on more than two fields at the same time is not possible)
  • Concatenate several fields into one could help query on more fields. See example in “Dilemma in Selecting Partition Key” section.
  • … more ways?

Usually, a field of timestamp as sort key is effective enough for trimming down the result query size. Then, using scan or other techniques to further filter on other fields would be OK.

More annoying facts…

Let’s talk about tricky topics when designing DynamoDB schema.

Dilemma in Selecting Partition Key

Cardinality — a term to describe the degree of variation of a collection. High cardinality, lots of variations. Low cardinality, only a few kinds.

People suggest in order to avoid hot key — which means one data partition got accessed too often comparing to other partitions, and performance could get clogged. So instead, you want to choose field of high cardinality as the partition key. However, huge drawback is it could make it hard to query — query always needs you to supply the exact partition key (“exact match” mentioned above). What’s wrong with that? If a field is of high cardinality, chances are it’s not easy to know its value when you need to search. An extreme case is UUID — max cardinality, but almost useless for query or search except a single-item Get operation.

You ask: well can I choose a field as partition key to aid query — to help narrow down the results? Yes you can, but with cost. One technique is to concatenate multiple fields into one. For example, you have an object that has fields property1 , property2 , property3 , property4 . You can concatenate several of them, say, the first two fields, as partition key, like property1Value-property2Value . Now when you query, you can query by property1 and property2 for free, achieving WHERE property1=property1Value AND property2=property2Value . Wonderful! Right? Hang on, does the concatenated property1Value-property2Value provide enough cardinality? If not, then uh-uh, hot key problem. If yes, next the 1st problem is you have to always specify both property1 and property2 for query, can’t be optional like only WHERE property1=property1Value or WHERE property2=property2Value . 2nd problem is if you’re selecting partition key for the base table, these fields must have non-empty value for all items. This could be annoying since you already prepared to go flexible and schemaless with NoSQL and now you’re saying we need some fields to be non empty?!

What people often do is creating GSI to tackle with the 2nd problem —in a GSI, partition key can be empty. So 2nd problem solved. To tackle the 1st problem…. you just have to live with it, that’s the cost for using partition key to query. So, you can choose the partition key wisely, think about whether to concatenate or not, if yes then how many fields to concatenate, because you’ll need to always specify them in query once concatenating.

I’d say when selecting partition key for base table, forget about aiding query, just use UUID or a field that you’re sure all items have non-empty value on. Then, since UUID is useless for query, you create GSIs so that you can select other two fields to query on.

Or, if you do want to use partition key to query, which is also inevitable in GSI, you always need partition key to query anyway, so you selected some low cardinality field as partition key, like a “status” field of value either “provisioned”, “approved” or “published”, then you worry about hot key. I found this quite confusing, on one hand query mandates partition key and can only do exact match, so you want to select a field with not too many variations to make the app’s search flexible & easy to use, but on the other hand this leads to hot key problem. Boom, is DynamoDB even usable? Well, you might just forget about hot key. I saw some articles on AWS blog saying DynamoDB eventually will do data node replication and load balancing for you when it detects a particular partition gets lots of attention. This offer is true for GSI, but not LSI (Local Secondary Index).

Not-so-true DynamoDB Best Practices

Now let’s talk about pitfalls — you thought they are “best practice” to follow, yet, not necessarily so!

Don’t use scan

Scan is notorious — since it goes through every database item — it blocks the database performance, it’s costly, slow, it’s not scalable. Query is fast, efficient.

But it doesn’t mean we should exclude using scan, even if we are seriously thinking about scalability and performance. This is true especially when you have complicated access patterns and data shape.

So using scan is fine, just, make sure the items to scan are “not too large”. When it’s in the middle — kind of large, but not too large, or you’re not sure about the future — you may still try your best to consider query (indexing) or other alternatives first.

Scan operation is not only used by Table.scan() , but also by filtering on query Table.query(FilterExpression=...) . A common way is first narrow down the result size by query on two fields (partition key, sort key), then apply a filter (scan). Of course, the result size better be small enough after your two-field querying. So, which two fields to select for query is crucial, that’s probably the biggest part where you will struggle, ask more about the app’s the access pattern, and go back and forth. Hopefully creating GSIs will give you enough flexibility.

And that’s it! The “quirks” when using DynamoDB. They do not limit you from achieving very complex query, but there’re extra cognitive load required, which is the main point I wanted to show you in this post. Lastly, it definitely takes time to get rid of SQL way of doing things like foreign keys and tables. It doesn’t have the convenient, flexible query SQL provides, but then we should ask ourselves: do we really need all of them? What can we let go? You probably heard of NoSQL forces you to plan your app access patterns AND even the scale of each pattern. Because you’re going to deal with these big constraints.

At the end, I give in — perhaps the inconvenience are the cost of scalability and performance that DynamoDB offers out of the box. And all that above is a quick takeaway of DynamoDB for now. This brainstorming process happened when I finally had a chance to orchestrate a DynamoDB table from start to end for consumer product.

I hope this post gives you an idea of what you’ll need to deal with if you come from SQL and decide to go for DynamoDB. For me, more post will come once I get more insight along the way.

--

--

Shaung Cheng
CodeX

(shaungc.com) I'm a Full Stack Engineer passionate about UX. My goal is to bridge the gap between dev and design.