Know how you read: Insights about DynamoDB read consistency models — Part 2

Published in

adidoescode

8 min readApr 26, 2024

Introduction

This is the second article of a two-part series about AWS DynamoDB read consistency models. If you haven’t read the first part, I encourage you to do that here: Know how you read: Insights about DynamoDB read consistency models — Part 1

Based on what we learned in the previous article, we will now analyze a simple application from a DynamoDB operations perspective.

Incident Management System

For this example we are going to build a centralized incident management system to monitor multiple IT services. Our application receives continuous heartbeat events for every monitored service and it is in charge of maintaining the latest status of each service persisted in a DynamoDB table. A heartbeat is usually an event sent periodically to verify that the service is up and running but, in this case, our system will also receive an event “Heartbeat KO” when some service is not working properly.

The functionality of the application is as follows:

Diagram 1: Incident management system logic as a state machine diagram

Disclaimer: This is not a real application, but rather a fictional example that is useful for reasoning about DynamoDB read consistency models.

You may have inferred the rules from the diagram but I will list them below:

If a“Heartbeat OK” event is received, the service will be updated as healthy, independently on the previous status.
If the status is “Service healthy” and a “Heartbeat KO” event is received, the status of the service is updated to “Service not responding”.
After receiving two “Heartbeat KO” events in a row, the status will change to “Service unavailable”.

Source: https://unsplash.com/es/fotos/una-mesa-de-metal-con-un-telefono-celular-encima-M9A3YswhVAw

Our application’s responsibility is not to store every event because that would be more like a logging system. The goal is to maintain the latest health status for each service. Therefore, the data model of our DynamoDB looks like this:

{
 "serviceId": "3a558fc5-aead-48b2-a7dc-2e0d237ad46e",
 "status": "HEALTHY",
 "updatedTimestamp": "2023–12–30 17:35:22:038"
}

The partition key of the table is serviceId, and there may be thousands of services being monitored.

Then the logic of the application can be broken down into the following steps:

Fetch the record from DynamoDB using the partition key.
Update the status attribute if needed.
Persist the change in DynamoDB if needed or discard it otherwise.

Disclaimer: For the sake of simplicity, we are assuming events are received only once and they come in the right order. Otherwise, we would need to use the ID of the event for idempotency handling, just like in previous example, and the timestamp of the event for handling order issues. This is out of the scope for the current analysis.

The following two diagrams show the most common scenarios:

Diagram 2: Service is stable, and Heartbeat OK event is received, so status remains the same.

Diagram 3: Service was healthy, but there is an outrage. Then the status is updated to “service not responding”

How to write

As previously mentioned, we don’t need to worry about idempotency (receiving the event multiple times), but we may also receive two different events concurrently having the same serviceId.

This is an example of two concurrent events we may receive:

{
 "serviceId": "3a558fc5-aead-48b2-a7dc-2e0d237ad46e",
 "content": "Heartbeat KO",
 "eventTimestamp": "2023–12–30 17:35:22:038"
}

{
 "serviceId": "3a558fc5-aead-48b2-a7dc-2e0d237ad46e",
 "content": "Heartbeat KO",
 "eventTimestamp": "2023–12–30 17:35:23:038"
}

This is not the most common scenario, but it might happen. Let’s see how our system would handle it:

Diagram 4: processing two concurrent events for the same service concurrently

Does this sound any familiar? Of course! This is the phantom read issue which we have covered in part 1. As you should know already, the solution for this issue in DynamoDB is to use conditional writes. For that we will use a new attribute version.

{
 "serviceId": "3a558fc5-aead-48b2-a7dc-2e0d237ad46e",
 "status": "HEALTHY",
 "updatedTimestamp": "2023–12–30 17:35:22:038",
 "version": 0
}

Every record will be created with version having value “0”, and it will be incremented with every update. This mechanism used in optimistic locking using a version number, which was referenced in the first part of this article.

Diagram 5: using conditional write to prevent phantom read issue

Bonus: Another thing to take into account is that we need to have a reattempt mechanism in place. We want to prevent overriding one update, but we also want to make sure the event which fails gets processed eventually. This is not only relevant for race conditions but for any unexpected error we may have in our application, such as database unavailability or network issues.

We confirmed what we knew from part 1: if we use conditional writes we will be safe from phantom read issues.

Now let’s jump to the most interesting part: how should we read from DynamoDB?

How to read

All right! We are going to use conditional writes, but how should we read the data? Is it okay to use eventually consistent reads?

Let’s assume the read is done with eventual consistency and we are lucky enough to get the latest data. Then we will update the status and the write operation will be successful.

But what if we are not that lucky and we have read stale data instead?

By the way, what the heck is stale data? According to ChatGPT:

In the context of distributed systems, stale data refers to data that has become outdated or invalid due to being superseded by newer updates or changes. Stale data can occur when there are multiple replicas of data distributed across different nodes or servers within the system, and updates to the data are not immediately propagated to all replicas.

For our use case it simply means our read operation didn’t fetch the latest version persisted in DynamoDB, but we got an older version instead. That may happen because the DynamoDB replica we read from wasn’t up to date with all the changes.

Then we will try to update the status and our write operation will fail. We will need to reattempt it and hopefully next time it will be successful. The latter scenario is represented in the following diagram:

Diagram 6: using eventual consistency read and conditional write

Therefore, even if the read operation returns stale data, it will be detected during write operation (as long as we write conditionally).

That’s great news, right? We can always read using eventual consistency and write conditionally. In the unlikely scenario in which we have read stale data, our write will fail and we will just need to reattempt.

That’s correct but is this approach efficient? What happens if status is healthy and we received a Heartbeat OK? Do we still have a need to update the record? The service is currently in healthy status and it will stay as healthy, so what’s the point of writing to the database if the status is not going to change?

Eventual consistency pitfalls

The problem is the following: if we read stale data, we may see the record as healthy but the latest persisted status might actually be “service not responding”.

Diagram 7: using eventual consistency read and discarding event because update is not needed

The status of the system should have been changed from “service not responding” to healthy, but it will (incorrectly) stay as “service not responding” instead.

This is a clear example where reading with eventual consistency is creating a data inconsistency issue.

Strong consistency read, are you still there?

To avoid that, there are, at least, two approaches:

Reading with eventual consistency and always writing back into the table using conditional write, even if the status doesn’t need to be updated. If the read status is different from the actual persisted value, it will be detected during the write operation since the condition expression will not match. We can create a new acronym for this pattern we just described named WCRE, which stands for “write conditionally after reading eventually”. If I may, I would like to compare it with its evil twin, WAR (write after read), which is a known race condition situation in the context of CPU architecture design. WCRE on the other hand, is a good pattern to use when working with DynamoDB.
Reading with strong consistency so you can safely discard the event. The decision of dropping the event is made based on the latest persisted status of the system.

Bringing our Incident Manager System to a real-world scenario, the expectation would be that the services stay healthy most of the time, so the majority of the events are likely to be discarded. Then, performing a strong consistency read is going to keep our system consistent, but which of the two approaches is the cheapest?

Diagram 8: using strongly consistent read so we have the confidence to discard the event safely

As you can see in DynamoDB pricing documentation, the cost of a write operation is 5 times more expensive than a read operation. Then, considering the pricing model in Ireland region, here is the cost of the two approaches (for one million executions):

Approach 1:

$0.283/2 (one million eventually consistent reads) + $1.4135 (one million conditional writes) = $1.555

Approach 2:

$0.283 (one million strong consistency reads)

In the first approach we need to pay for one eventual consistency read and one conditional write (we will pay for it even if the write is not performed due to condition failure). In the second one we pay for a single strong consistency read (which is twice as expensive as an eventual consistency read). Then, the second approach is about 5.5 times cheaper than the first one. Assuming this is the most common scenario, using a strong consistency read helps to reduce the costs while keeping the system consistent.

Nevertheless, there may be other business cases where we need to write most of the times. If that’s the case, it will probably be worth changing it to always performing the write operation so we don’t make decisions based on stale data. Then we can safely replace the strong consistency read with eventual consistency read, which will reduce latency and AWS costs.

If our system can afford the risk of having data inconsistencies, we could consider using eventual consistency reads without a further write. But in other business cases where discarding an event means losing a consumer business transaction, using strong consistency reads is totally worth it.

Summary and conclusions

In this second part of the article we have presented an Incident Management System use case, reviewed some edge case scenarios and analyzed the results depending on the consistency model we use.

Finally, I’m listing some interesting conclusions we reached in this second part of the article:

Making a decision based on stale data (as a result of an eventual consistency read) might create inconsistencies in the system.
It is safe to read with eventual consistency as long as the record is later updated using a conditional write (WCRE pattern).
In case the record we read is not always going to be updated into DynamoDB, the safest approach is to read using a strong consistency read (if we want to guarantee data consistency).

The views, thoughts, and opinions expressed in the text belong solely to the author, and do not represent the opinion, strategy or goals of the author’s employer, organization, committee or any other group or individual.