Image source

Database Evolution Part 2: Ways to evolve schema to handle scale

Abdullah jaffer
Bazaar Engineering
Published in
5 min readJul 11, 2022

--

A while back at Bazaar we ran into an interesting problem, we needed to introduce a new level of granularity to our existing database schema to support a new feature, what this usually translates to is adding another table and introducing a one-to-many relationship to support the new layer of granularity.

But it was not so simple at the scale we were at, we already had a lot of tables and joins on tables required to perform lookups for this specific use case, adding another table onto that existing schema might become a bottleneck, especially when we were already stretched in terms of latency for the APIs that will use this new feature.

In this article, I’ll explain how we evaluated different approaches and what approach we decided to go with. Read part 1 of this series to find out how to evolve indexes in your schema along with change requests

Scope

While discussing these approaches we were only considering them from the lens of the database layer which was a relational database in our case, so the solutions presented here are within that scope only.

Context and Current Structure

First I will explain the schema which we needed to navigate around. This is not the actual schema for confidentiality reasons but it has a very similar context.

Image by Author, made with draw.io

Product
Represents global product attributes such as title, size, type of product, etc.

Category
The category of the product, products for a specific category are fetched

Batch
The batch here represents properties such as batch age, and quantity

Batch_City
Represents city-level batch properties, such as pricing

The parent table in this relationship is the Product while the most granular table here is Batch_City. Batch_City represents the properties of the product batch on a city level.

Batch_City Schema

Image By Author

Behavior required

  1. We need another further level of segregation called customer_cohort.
  2. Each batch can be in multiple customer_cohort.
  3. Each batch can have different price_actual and price_discounted based on customer_cohort and city.

Approach 1: Split Batch_City into two tables

Updated Batch_City table

Image By Author

New Batch_City_Price table

Image By Author

Positives of this approach

  1. Easily Searchable, filterable

Side-effects of this approach

  1. Requires joins every time we need to get prices and cohorts for a product batch.
  2. On top of adding an additional join, this style of scaling is not ideal as any new type of segregation will require adding another table with joins whenever filters are required on that level.

Approach 2: Put Customer_Cohort level information in a JSON field

Image By Author

Positives of this approach

  1. Can avoid extra join.
  2. By having different JSON fields for customer_cohorts and pricing we can do less JSON parsing and lookup if we only need available cohorts for a batch.

Side-effects of this approach

  1. Difficult to search and index since indexes on JSON are not supported out of the box in the database we were using, MySQL.
  2. We have to parse JSON for every product to get its cohorts and prices on each lookup.

Approach 3: Denormalisation

Some of you might have predicted this as a possible approach, since it’s in the title, and this is what we went with. The idea was to denormalize the batch_city and for every batch, city combo, there would be a new row for a cutomer_cohort that is added.

Image by author

Positives of this approach

  1. No joins.
  2. Can keep on adding further segregation levels without much effect on latency as that can keep catered with composite indexes( Increasing space requirements ).
  3. Ability to rollback in case a segregation level is not required. For example, choose a source of truth customer_cohort, like GENERIC, delete rows for all other cohorts, then delete column customer_cohort.

Side-effects of this approach

  1. Exponentially growing vertical table as new customer cohort or another filtering column for a product is added and thus increasing storage requirements.
  2. Increased complexity in application code since all updates and insertions in denormalized tables must be catered to differently.

Approach 3.1: Division into data and filtering tables

Dividing the denormalized batch_city table into a filtering table called batch_city_search and a data table called batch_city_data table, respectively.

Image by Author, table 1 is batch_city_search and table 2 is batch_city_data

Positives of this approach

  1. Only one-to-one joins
  2. Easy to add a new batch_city_data row and a new column in the batch_city_search table
  3. Reusable data rows
  4. By vertically slicing some columns and then compressing them(through reuse) we can mitigate storage requirements somewhat.

Side-effects of this approach

  1. Exponentially growing vertical table as a new filtering column for a product is added, this would still remain a concern.

Approach 3.2: Opus Magnum, single columns

By converting our data table columns into random strings by passing them through a hash and storing the hash string of these columns as a single column we save up space even further.

encode(batch_city_data_fields) = hash_result

Positives of this approach

Further mitigates storage issues by having only one column

Side-effects of this approach

Complex application code.

Final Approach

In the end, we decided to go for approach 3, not including 3.1 or 3.2, since storage was not that much of a concern for us at the time, so we put off storage optimizations to a later time if storage constraints become a legitimate issue.

Disclaimer:

Bazaar Technologies believes in sharing knowledge and freedom of expression, and it encourages it’s colleagues and friends to share knowledge, experiences and opinions in written form on it’s medium publication, in a hope that some people across the globe might find the content helpful. However the content shared in this post and other posts on this medium publication mostly describe and highlight the opinions of the authors, which might or might not be the actual and official perspective of Bazaar Technologies.

--

--