I love relational databases. I love the challenge of designing a well normalised entity relationship model, where the goal is to eliminate data duplication through clever table design and the appropriate use of foreign keys and indexes. It’s like playing a round of golf; the maximum result is achieved by implementing the minimal solution.
At the time of writing, serverless relational databases are in their infancy, and in my previous article, I briefly mentioned the issues with relational databases as they scale. Given that they’ve served me well for so long, I thought the decent thing to do would be to elaborate on this.
Leaving the Factory
Any software solution with a database component is fast and snappy during the development phase. Unfortunately, this is because the database is relatively empty; less data to query yields faster response times, which is in no way indicative of the real world.
We can attempt to predict future performance by artificially loading our development database with years worth of test data, but this is mere educated guesswork for two reasons — we have no idea how fast a database will grow, and we have no comprehension of how the data stored will vary.
It’s a bit like a brand new car leaving the factory. What will it be like to drive in 3 years time? It depends on the mileage, what sort of roads it has travelled on, if it has been properly maintained and serviced. Databases are like that too.
Let’s look at an ER model for a simple eCommerce application:
Customers place orders. Orders consist of many items.
If I want to know the total value of an order, I join Order to OrderItem to Product. If I want to know who my most valuable customers are, I’ll join in the Customer table here too. As my business thrives and my database grows, these joins become bigger. This is where partitioning comes in.
OrderItem should be the fastest growing table here; we would expect customers to place multiple orders, and each order to consist of multiple items. When I want to fetch all items in an order, I look for all rows where OrderItem.OrderId = Order.Id. As these tables become bigger, the more data we must wade through. Partitioning allows us to split up our large table thus:
In the figure above, I’ve split the large OrderItem table into a series of smaller partitions, based on the value of OrderItem.OrderId — this is known as my partition key. The first partition holds those rows where OrderId is in the range 1–1000; the second 1001–2000; third 2001–3000…and so on. Now when I want to list the items in an Order, I do not need to search all 100k values in the whole OrderItem table, just those in my relevant partition.
I could, in theory, partition on the exact value of OrderItem.OrderId. However, SQL server has a limit of 15k partitions per table.
Sorted? Eh, not quite…
What’s the problem with this approach? Well, I can only specify one partition key per table. I made an assumption here that OrderItem.OrderId would be my silver bullet, but it’s only ideal for the angle we looked at, namely getting the contents of an order.
As a business owner, I’ll want to know what my best selling products are each month. I’ll start to do that by joining Product.Id to OrderItem.ProductId. As the OrderItem table gets bigger, this query takes longer for the same reasons given above. If I’ve partitioned my table based on OrderItem.OrderId, I now have to query across multiple partitions. I’ll have to join on the Order table as well to get the OrderDate so I can calculate monthly sales figures.
I can’t have two different partition keys on the same table. I need both operations to execute quickly in a fully serverless architecture — viewing orders and viewing my sales inventory. Over time, my relational database will weigh me down. What to do?
What’s the harm in repeating myself anyway?
So this goes against everything normal form has taught me.
We have two tables.
Okay, this is NoSQL, so we have collections, not tables:
Each document in Orders collection encompasses the OrderDate and the Items in the order. No need to join tables here! We can partition this based on OrderDate, and better yet, archive documents older than a retention period. One idea may be to archive Orders older than one day (or possibly a few hours), as we only need active sale data for the daily running of the site.
The ProductSales collection details when products are purchased, the quantity and total cost. Again, we can partition / archive based on OrderDate, this time based on the month of sale so that we may easily sum up monthly sales and inventory figures.
Okay, we’re using more storage space, but storage is cheap! If we define clear archiving policies, it’s even cheaper.
The principle here is that if data is rarely used, it shouldn’t clutter up our active data store.
Yes, this isn’t normal form. Yes, we’re repeating ourselves. But in breaking apart these close knit relationships, we can archive / partition and optimise each collection for speed, which is what a serverless architecture depends upon.
High performance leads to a better user experience, leads to repeat business and a profitable store.