How To Rollout Applicative DB Changes — Tales From The Trenches

Published in

Wix Engineering

5 min readMay 2, 2019

In every new service lifecycle there is a design process, and if such service is required to save data to DB there comes a time when we design the DB model of the application. Everything remains a draft idea of DB structure up until we save the first record to our database.
From now on, our code is committed to a specific DB model to represent the business model. Those saved DB records might get updated or deleted, and any attempt to change the DB model will require careful handling and potentially a migration procedure.

DB Changes are different from code changes

It takes a long time to execute DB changes, the bigger the dataset the longer it takes for the filesystem to write data.
When saving user-generated data, overtime our DB saves a lot of permutations of states. some even might seem impossible to us according to code. Keep in mind that code evolves but data remains.
Data corruption is very hard to recover from as it requires an understanding of the corrupted data and the ability to reconcile it to the valid state which is no longer present.
It’s complicated to rollback the ongoing migration process.

DB changes are a big potential point of failure — and so they require extra planning.

Giphy — http://gph.is/VxeHsW

A Real-life trench story of modifying column data type

In a billing system I develop which uses MySql database, there is an carts table with an auto-increment cart_id counter.
I wanted to modify data type from INT(11) to BIGINT as preparation for future growth, while the table size is still small (~50 Million records) so it’s easier to prepare now then later.

The actual impact on users:

2:27 hours of purchase page submit downtime, the page loaded but user failed to submit purchases
1038 unique users failed to submit a purchase
Time to detect the issue by an automated alert: 8 minutes

Table definition before migration:

CREATE TABLE IF NOT EXISTS carts (
  cart_id INT(11) NOT NULL AUTO_INCREMENT,
  date_created DATETIME DEFAULT NULL,
  date_updated TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  user_id VARCHAR(64)   NOT NULL,
  description MEDIUMTEXT,
  status SMALLINT(6)        NOT NULL,
  amount DECIMAL(10, 2)
  PRIMARY KEY (cart_id),
  KEY user_id (user_guid),
  KEY status (cart_status)
)  
PRIMARY KEY (cart_id)
)

What was executed:

ALTER TABLE carts MODIFY cart_id BIGINT NOT NULL;

What should have been executed:

ALTER TABLE carts MODIFY cart_id BIGINT NOT NULL AUTO_INCREMENT;

Root Cause Analysis

The root cause of this incident was just the plain old human error, it was a mix of both missing knowledge and miscommunication. AUTO_INCREMENT was missing from the alter table statement, so the generation strategy for that NOT NULL cart_id column was lost, as a result, all insert operations failed with a constraint violation.

The table alter took 2 hours due to a large number of rows and immediately once finished all insert operations failed as cart_id wasn’t generated and NOT NULL constraint failed.

Mitigation

To recover we executed the correct alter table command that returned the AUTO_INCREMENT strategy, that took another 2 hours as that’s the time it took to run the alter table in the first place.

After the full recovery, we focused on user impact. We identified the affected 1037 affected users and we noticed that 50% of users repurchased within the next 12 hours.
Also, we launched a targeted campaign to the remaining users that explained the technical issue we had during the time they tried to purchase and offered them a complimentary discount coupon, this allowed us to reduce branding hit and restore our users confidence in our system.

Until the singularity replaces us humans as coders, we should learn from incidents that result from human error to prevent it from happening again.

Lessons Learned

Prepare: Consult with other developers and/or a DBA about changes you plan to run, preferably someone not from your team.
Tests: Make sure your tests are running in the same environment as production and the DDL is the same as you plan to run in production. Make sure your DB change is covered by tests, if possible start by having a failing test. Test the possible permutations of backward / forwards compatibility of your data before and after the DB change.
DB logic: try to avoid generating values in DB such as auto-increment sequence, such cases are hard to test and your application code can have better control over it.
Audit: if applicable keep a log of data being migrated and the original request.
Feature Toggle: If possible, expose the DB change only by feature toggle to reduce impact in case of failures.
Backup: Before starting a migration process make sure you have a recent backup of your database.
Timing: Choose the right time for it. Team availability or traffic trends might affect timing.
Notify: Let stakeholders know about the planned change in advance, for example, your change might affect a downstream ETL process.
Monitor progress: make sure you can measure the progress of the change you make.
Monitor SLA: make sure you have alerting in place that will let you know as soon as any SLA in your system is affected.

Backward / Forward compatibility

Atomicity of the change is complicating the process as each record can be in any of the two versions, and maybe that record is in the middle of another business flow. As a result, everything you know about backward/forward compatibility best practices is brought to the extreme.

In some cases splitting the changes to a few steps can reduce risk, try to create a new column and copy the migrated data there. This way you will not lose the old data. and you can control applicatively read/write operations from each column until you deprecate the old column completely.
Always stick to best practices of backward/forward compatibility.
Also, you can consider running the applicative migration by lazy migration /self-healing.

Summary

In a web-scale billing system “The spice must flow”, it’s like a store that should remain open 24–7. A 2-hour purchase page downtime is a very bad hit to the SLA.

DB changes are not like any other code change, they require extra planning as recovery is much more difficult and sometimes not possible at all.

Handle your DB with care…

COMMIT