Lessons learnt on NOSQL and why we moved away from it

In the project that I am working on, recently we took a decision to move away from a NOSQL datastore, AWS Dynamodb. To explain why, I need to talk about the premise on which the DynamoDb was chosen to begin with.

The application was about managing a business entity called a Form . You can imagine it as a json blob with key-value pair, with every key representing a field in a form, henceforth referred as a FormField , which contains the value of the field along with its properties like color, font and fontSize.

Json representation of a Form

The following points were considered to chose a NOSQL datastore for this application (and Dynamodb subsequently)

A form is represented as a key-value structure. Hence, storing it in the same way has less overhead of rehydrating in to objects, when compared to using ORMs in a SQL database.

Relational aspects for this data was minimal. A form can have many fields.
A form is managed for a user through other services. Hence, this application did not need to store the relation between form and user.

Now to talk about why we chose DynamoDB. We were already into AWS big time. And the application was decided to build on serverless architecture, to support scalability. Dynamodb provided a neat stateless interface to call from a stateless lambda, which fits into the picture nicely.

Also it is very simple to create Dynamodb tables without any networking overheads and manage it by just IAM policies, unlike having to hose a RDS instance inside a private subnet.

More about the nuances of this application.

A user can see this form on a UI and can change the field values in real-time. A form can also be managed by multiple users at the same time. Imagine two users looking at the same form on their respective browsers. A change made by one user should appear to the other user in real-time.

To achieve this, the UI connects to the backend via web-socket. Updates made by users will be saved to database and published back on the web-socket again to be made available for other users who happen to be subscribing on the same topic, much like a chat application.

As you can imagine, one of the common use case is to be able to update a form without conflicts. User 1 can change fieldA and User 2 can change fieldB simultaneously. The form should be consistent after updating two fields.

Also, a thing to note here is that, multi-user scenario turned out to be not so common after further user research, nonetheless product wanted this very particularly in the initial days.

Now, keeping these in mind, the following schema was developed

PrimaryKey: formId (a uuid)
SortKey: fieldKey (a key representing the field in a form)
Example:PrimaryKey: c92f5c51-d2c5-4c10-a4f3-0d3cdd1fbdd7
SoryKey: fieldA

Each field stored as a separate document in DynamoDB, which allows to update in independently of other fields, to avoid conflict.
All fields for a form can be queried by primary key, to get the complete form.

A user may want to automatically fill in the form values from data stored in other parts of the application.

We wrote apis to be consumed by other services to update a form, which will make separate update queries to dynamo table, to update every field.

At this stage, if you are wondering why we didn’t chose to store the entire form in one document, it was because of the dynamo’s size limit of 300KB per document. There could be some forms which were more than 300KB in size, which turned out to be a problem.

Everything was working as expected for about a year until the product decided to introduce new features….

Product wanted a new feature to support extra large forms, containing tens of thousands of fields. The teams charged with building these features, started consuming our application to manage those forms.

Now when they started hitting our apis to update such forms, we started seeing first cracks in our so far solid architecture….write throttles in dynamodb.

As the primary id for every field in a form is same (the formId), dynamo stores all these documents in a single partition. Dynamo also has a hard limit of 1000 WCU/sec for a partition. Now, assuming a field update will take upto 1 WCU, we have a limitation only being able to update 1000 fields per second.

Throwing money at it, to increase write capacity wouldn’t help in this case.

While we were contemplating this problem and trying to workaround this, by adding artificial delays to spread the writes out in time, one day, a bug in a consuming service caused it repeatedly call this apis to update an extra large form, which throttled the database essentially cutting out db writes in production.

This was a wake up call.

We formed a working group with other lead engineers and architects and came up with several options including making changes to dynamo schema and moving away from dynamo completely.

We started with a paper based, back-of-envelope model to predict the peak load patterns that we expect. I remember asking ourselves, why we didn’t start with this couple of years ago while designing initial architecture :)

We initially stated to ponder over an option to prefix the primary id of every field with a random suffix, to ensure the fields are distributed over the partition as explained here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-sharding.html

Even-though it looked promising, it did not meet our model to support spiky load patterns during peak times. This meant that we will run into high operating costs in dynamo, to keep our application highly available.

We finally select an option to use a relational database, when the entire form data can be stored as a json blob in a single row. We selected postgres as it was a company standard among relational databases, and chose RDS Aurora, as Aurora promised higher performance in general compared to regular postgres.
A row can be as big as 1gig in postgres, which is enough to store any form our product could think about.

An initial POC with RDS Aurora was very promising to support the predicted load. The operating cost even with a replica in every AZ was just half of current dynamo cost (Not the cost of what we had to pay with Dynamo to support the predicted load)

The only thing that didn’t sit well initially was the networking with RDS and having to run lambdas within a private VPC. We had heard stories of how lambdas have additional overhead of having to create ENIs to be able to access RDS db and we were afraid that this would cause a huge problem.
But fortunately AWS recently solved it for — https://aws.amazon.com/blogs/compute/announcing-improved-vpc-networking-for-aws-lambda-functions/

The overhead of creating network infrastructure was seen as a one time investment and we don’t normally have to meddle with it again in the lifetime of this application.

Parallel form updates became a standard database concurrency issue, which was solved by opting optimistic locking mechanism, as we don’t anticipate a concurrency issue usually.

Redesigned architecture with RDS and lambdas running in a private subnet

These are the benefits we saw after migrating to RDS

  1. The latency to read forms was dropped to half, as we don’t have query individual fields separately anymore.
  2. Saving big form was not a problem anymore as was a lot faster, as we don’t have to update separate fields anymore
  3. RDS db instance always stayed within 30 pc of its cpu during our load testing, which gave us further confidence about the availability of our application during predicted peak load.

Here are the takeaways

  1. With NOSQL, you should have the full product roadmap before you start designing database schema, as there is very less room to change partition schemas.
  2. Once the db schema is designed, draw strong boundaries on what can be achieved and what cannot be achieved with the designed schema. If product wants something later which does not sit well with the initial assumption, it should be made very clear that cost of achieving that is very high. It can also be a new application later, without having to modify the existing app, if thats possible in your use case.
  3. Always have a model to predict the load patterns in production and do some quick POCs initially to prove/disprove assumptions in design.
    A couple of weeks of more planning initially will save months of effort to rewrite later.
  4. Relational database seems to be more generic and offer more flexibility for unforeseen changes compared to NOSQL.

Hope this provides some insights while selecting NOSQL/Relational database for your project.

Developer, Reader, Thinker

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store