Reducing reads on DynamoDB

Published in

uniplacesgeeks

5 min readFeb 14, 2018

At Uniplaces we are helping students to find accommodations while they are studying abroad. Students can pick from shared rooms, private rooms or entire flats. In our ubiquitous language, we call these offers.

An offer has many attributes that describes it, think about its name, price, availability. Some of these attributes are not in the domain of the offer (in terms of DDD) for example the neighbourhood in which the property is. On the website when we are displaying an offer we are trying to display useful informations about it for the users. This requires us to gather all these informations from different domains and serve it to the webpage so it can render it. We have a service for this which aggregates the entities, this service connects to different DynamoDB tables, reads all the relevant information, then merges it and returns it as a JSON object. This worked quite well, the API had reasonably fast response times thanks to the concurrency model to fetch the different domain objects. But this introduced another a problem: too many reads on DynamoDB.

Because we designed one API to serve multiple consumers we ended up loading in too many entities that might not be relevant to each consumers. We sat down and started thinking about how could we reduce the reads. We wrote up a decision log (I’d recommend start using this to document your team’s decision if you aren’t already doing it) about the different ideas we had:

Cache entities inside the aggregator
Use DynamoDB DAX to cache
Use an HTTP reverse proxy cache

We went through the pros and cons of each of these solutions, different HTTP cache providers and calculated the estimated costs. At the end (given we are running on AWS) the cheapest solution was to use CloudFront to cache the HTTP responses.

The first architecture (young and naive)

We drew up a solution where we cache the aggregates for a very long time and we invalidate the cached items via CDN purge requests. The problem with this is that CloudFront only gives you a 1000 invalidation requests per month and it charges a lot for any invalidation requests above this limit.

This is not how HTTP caches should work

When you are caching via HTTP you can set expiry headers, etag etc… The idea is that you cache your objects until you think they are valid.

I’m only going to look at time based caching here.

Use expires header when you know when the item will be considered stale, for example cache today’s weather forecast until midnight today
Use max-age header when you the validity is for a specific period, cache news homepage for the next 5 minutes
Use purge when you made a boo-boo, published something to your CDN that shouldn’t have been published

Immutability to the rescue

We had to go back to the drawing board: redesign the architecture to treat aggregates as immutable objects. But what is an immutable object? It’s just what it says on the tin, an object that once created won’t change. It won’t change its content or any attributes attached to it, if you need to change something regarding it, you will need to create a new version of it (which again will be immutable).

We can achieve this by starting to version our offers. Let’s cache each of these aggregates for a reasonably long time, for example a day, and if something changes about the offer (like its availability) we create a new version of it and serve that to our API consumers.

But how will our consumers know the version they need to fetch?

They could all connect to our version store and fetch the version, then do the request for the offer. This sounds like a lot of work to do, we have clients in many microservices, single page applications, web applications, changing all these would require a lot of code change, testing, release. All of these consumers do share one common thing, they all use an HTTP client to fetch the offer. So why not abstract the version logic into a service that responds with a redirect that has the version number in the url for the resource. No changes will be needed on our consumers as they all can follow a redirect response.

The new architecture

This architecture required us add two more services:

One that is listening to changes and updates the offer’s version
One that acts as a redirect endpoint

The redirect endpoint was perfect for a lambda function and the version change logic was put into one of our container cluster.

Second architecture (bit more complex, but based on best practices)

When any important attribute of an offer changes we put a message on SQS. This message is read by a service that updates the version for the offer.
Consumer (HTTP Client) sends a GET for an offer aggregate.
API Gateway forwards the request to a Lamda function that fetches the version number for the given offer aggregate.
The client gets back a 302 (temporary redirect) that points to the CDN with a version number appended to the URL (http://<cdn-host>/offer-aggregate/<id>?version=<version-number>)
The client then follows the redirect URL
The CDN either serves a cached version of the URL if it has it already, if it doesn’t, it goes to offer aggregator directly and fetches the aggregate.

Conclusion

We managed to drastically decrease the number of aggregate services on the cluster, which in turn allowed us to decrease the cluster’s size. Our reads on DynamoDB has decreased a lot which allowed us to decrease our provision reads. We saved a lot of money here.

We now have costs on CloudFront and Lambda but these are much less than the amount we saved on DynamoDB and ECS.

How could we improve it more?

This solution was a quick enough win for us to save a lot on DynamoDB reads. But this is still far from perfect.

In an ideal World we would drop the aggregator service and replace it with GraphQL. Each domain could have the same immutable architecture for their API responses and the clients could define what they actually want to fetch.

In front of our Lambda function we could cache the version responses in the API Gateway service, making even less calls to DynamoDB and faster redirect response times.