Database Evolution Part 2: Ways to evolve schema to handle scale

Published in

Bazaar Engineering

5 min readJul 11, 2022

A while back at Bazaar we ran into an interesting problem, we needed to introduce a new level of granularity to our existing database schema to support a new feature, what this usually translates to is adding another table and introducing a one-to-many relationship to support the new layer of granularity.

But it was not so simple at the scale we were at, we already had a lot of tables and joins on tables required to perform lookups for this specific use case, adding another table onto that existing schema might become a bottleneck, especially when we were already stretched in terms of latency for the APIs that will use this new feature.

In this article, I’ll explain how we evaluated different approaches and what approach we decided to go with. Read part 1 of this series to find out how to evolve indexes in your schema along with change requests

Database Evolution Part 1: Evolving indexes

A significant thing to keep in mind when creating scalable software is that your database schema is your bread and…

medium.com

Scope

While discussing these approaches we were only considering them from the lens of the database layer which was a relational database in our case, so the solutions presented here are within that scope only.

Context and Current Structure

First I will explain the schema which we needed to navigate around. This is not the actual schema for confidentiality reasons but it has a very similar context.

Product
Represents global product attributes such as title, size, type of product, etc.

Category
The category of the product, products for a specific category are fetched

Batch
The batch here represents properties such as batch age, and quantity

Batch_City
Represents city-level batch properties, such as pricing

The parent table in this relationship is the Product while the most granular table here is Batch_City. Batch_City represents the properties of the product batch on a city level.

Batch_City Schema

Image By Author

Behavior required

We need another further level of segregation called customer_cohort.
Each batch can be in multiple customer_cohort.
Each batch can have different price_actual and price_discounted based on customer_cohort and city.

Approach 1: Split Batch_City into two tables

Updated Batch_City table

New Batch_City_Price table

Positives of this approach

Easily Searchable, filterable

Side-effects of this approach

Requires joins every time we need to get prices and cohorts for a product batch.
On top of adding an additional join, this style of scaling is not ideal as any new type of segregation will require adding another table with joins whenever filters are required on that level.

Approach 2: Put Customer_Cohort level information in a JSON field

Positives of this approach

Can avoid extra join.
By having different JSON fields for customer_cohorts and pricing we can do less JSON parsing and lookup if we only need available cohorts for a batch.

Side-effects of this approach

Difficult to search and index since indexes on JSON are not supported out of the box in the database we were using, MySQL.
We have to parse JSON for every product to get its cohorts and prices on each lookup.

Approach 3: Denormalisation

Some of you might have predicted this as a possible approach, since it’s in the title, and this is what we went with. The idea was to denormalize the batch_city and for every batch, city combo, there would be a new row for a cutomer_cohort that is added.

Positives of this approach

No joins.
Can keep on adding further segregation levels without much effect on latency as that can keep catered with composite indexes( Increasing space requirements ).
Ability to rollback in case a segregation level is not required. For example, choose a source of truth customer_cohort, like GENERIC, delete rows for all other cohorts, then delete column customer_cohort.

Side-effects of this approach

Exponentially growing vertical table as new customer cohort or another filtering column for a product is added and thus increasing storage requirements.
Increased complexity in application code since all updates and insertions in denormalized tables must be catered to differently.

Approach 3.1: Division into data and filtering tables

Dividing the denormalized batch_city table into a filtering table called batch_city_search and a data table called batch_city_data table, respectively.

Image by Author, table 1 is *batch_city_search* and table 2 is *batch_city_data*

Positives of this approach

Only one-to-one joins
Easy to add a new batch_city_data row and a new column in the batch_city_search table
Reusable data rows
By vertically slicing some columns and then compressing them(through reuse) we can mitigate storage requirements somewhat.

Side-effects of this approach

Exponentially growing vertical table as a new filtering column for a product is added, this would still remain a concern.

Approach 3.2: Opus Magnum, single columns

By converting our data table columns into random strings by passing them through a hash and storing the hash string of these columns as a single column we save up space even further.

encode(batch_city_data_fields) = hash_result

Positives of this approach

Further mitigates storage issues by having only one column

Side-effects of this approach

Complex application code.

Final Approach

In the end, we decided to go for approach 3, not including 3.1 or 3.2, since storage was not that much of a concern for us at the time, so we put off storage optimizations to a later time if storage constraints become a legitimate issue.

Disclaimer:

Bazaar Technologies believes in sharing knowledge and freedom of expression, and it encourages it’s colleagues and friends to share knowledge, experiences and opinions in written form on it’s medium publication, in a hope that some people across the globe might find the content helpful. However the content shared in this post and other posts on this medium publication mostly describe and highlight the opinions of the authors, which might or might not be the actual and official perspective of Bazaar Technologies.

Database Evolution Part 2: Ways to evolve schema to handle scale

Database Evolution Part 1: Evolving indexes

A significant thing to keep in mind when creating scalable software is that your database schema is your bread and…

Scope

Context and Current Structure

Behavior required

Approach 1: Split Batch_City into two tables

Updated Batch_City table

New Batch_City_Price table

Positives of this approach

Side-effects of this approach

Approach 2: Put Customer_Cohort level information in a JSON field

Positives of this approach

Side-effects of this approach

Approach 3: Denormalisation

Positives of this approach

Side-effects of this approach

Approach 3.1: Division into data and filtering tables

Positives of this approach

Side-effects of this approach

Approach 3.2: Opus Magnum, single columns

Positives of this approach

Side-effects of this approach

Final Approach

Disclaimer:

Written by Abdullah jaffer