Schema Management at Udemy

Published in

Udemy Tech Blog

9 min readJul 31, 2020

Managing database schemas while trying to practice CI/CD is difficult. In this blog post, we will explore some of the challenges, and some of the ways Udemy is tackling them. If you are attempting to integrate database schema into your CI/CD pipeline, or already have, read on!

The prime principle of CI/CD is that changes developers make should be easily and quickly shipped to production, while still allowing your customers to use your service. When you are only shipping code, this has some well-known solutions — for example, red-green or rolling updates and backward compatible APIs. However, when you need to adjust underlying database schemas this can quickly become an extremely long process, with increased risk of production problems if things go wrong. Adding or changing fields to large or heavily used tables without triggering production downtime is difficult and time-consuming; once complete, trying to ensure that those changes are useful and safe is extremely important.

Udemy tries to push the bounds of CI/CD as much as possible. We are applying up to 10 database schema changes every day, with a median of about 3. Udemy has gone through several different stages of how to apply schema changes; from a bespoke server rotation method to pt-online-schema-change to now using gh-ost. (Hint for others who are just starting down this road and don’t know the consequences; just use gh-ost. It has far less production impact at any actual scale.) However, this does not tackle all of the challenges for CI/CD.

Challenges

When updating databases, it is critical to ensure that the new schema is backward compatible with previous versions of your code. Either red/green or rolling releases have multiple versions of your software running at the same time — both versions will need to be able to talk to the same underlying database. This holds true for monolithic codebases as well as microservices.

Udemy has a combination of a Django-based monolithic codebase, as well as a growing set of microservices, all residing within a monorepo. They are all backed by a set of MySQL clusters (among other data stores). Since services may be written in multiple languages, we standardized on SQL for production schema alters. However, having SQL separate from ORM model definitions can lead to divergence between what the code expects and what the database contains. For example, a Django model may declare a field as a CharField (max_length=60), while the database has a VARCHAR(30). This same divergence can happen across any ORM in any language in our infrastructure.

In addition, there is a large timing issue around how schemas actually flow through the CI/CD pipeline. If we have the schema alters in the same pull request as the code that uses them, we do not want multi-hour (or sometimes even multi-day) schema alters to block our CI/CD pipeline. Even if we added a custom ORM integration with gh-ost, we do not want developers to merge their feature to master and then block the CI/CD pipeline for hours or days while their schema alter is applied.

If the schema alter is in the same PR as the code and we allow schema alters to be performed from branches prior to merging, a different problem emerges. There is a chance that the developer may abandon their PR before merging their feature to master — thus leaving the production schema diverged from what the master branch expects. Conversely, if schema alters are dealt with in separate PRs from the code that uses them, then it becomes excessively difficult to ensure the schema will actually be useful or even used.

Udemy is also quickly scaling our engineering organization. This gives us two separate impulses — we want to ensure that any ‘standards’ have equivalent tests to help guide new engineers towards following those standards and we also need to increase our automation as the schema alter workload increases since we have more projects being worked on in parallel.

Solutions

Udemy has created several schema standards to ensure that all schemas are backward compatible. These include:

Never rename fields; instead, create new fields. Have code that updates both but reads from the ‘old’ field, then ship code that reads from the ‘new’ field while updating both, then ship code that only updates the ‘new’ field, and finally flag the old field for deletion. Yes, it is a pain (a minimum of 3 PRs, assuming schema alters are allowed in code PRS; 5 if not); but this is similar to the work that would be done to update APIs used by JavaScript — one of the known costs of implementing CI/CD.
All clients must specify the list of fields they want to access. Never use ‘SELECT * FROM …’, because the field list may change. If expected fields are not there, you want your code to fail. (Yes, there are ways to work around this. However, it is safest to go this route rather than deal with dynamic field detection and error handling at a higher level.)
If you are adding a field to an existing table, it must either have a DEFAULT or allow NULL. Previous versions of the code will not be populating the field. Once you have deployed the initial code and the field is now updating appropriately, a later PR can remove the DEFAULT or set it to NOT NULL. This follows similar stages to #1 above.
While schema alters that create or update fields or tables are applied before the code that uses them is merged to master, schema alters that delete fields or tables are applied after the code that stops using them is merged to master and shipped. Schema alters that both create/update and delete are not allowed in the same pull request.

Following these rules, we have not had significant downtime due to production schema alters in several years at Udemy. At first, these standards were only enforced via code review. However, it increased the amount of time and attention needed for reviews. As Udemy grows, this has become more of a pain point. We recently introduced a SQL lint-like check to automate checking against these rules and others, and a schema validation check to ensure our code and schema actually agree.

For the lint check, we look not just at the backward compatible rules, but also several other standards. For example, we have a rule that all fields should be TIMESTAMP instead of DATETIME to avoid problems around timezone translations. The lint check only needs to check newly added files, so it is a very fast check. We can easily run it as part of a pre-commit check, which greatly shortens the feedback loop for developers. (We also run it during normal test runs on Jenkins, just in case someone hasn’t set up their pre-commit checks properly or decides to skip them.). Unfortunately, we found no prebuilt MySQL lint tools, so we added tooling around the sqlparse python library. Sqlparse is more designed (and tested) for ‘SELECT …’ queries than for ‘ALTER TABLE …’ (the latter does work, but we experienced many oddities in the parsed structure; luckily, extensive unit tests for the linter ensure we get it right).

To ensure that our code matches the database schema, we introduced a schema validation test that uses Django’s existing tooling to create tables from the model definitions using an adjusted table name. We then use the SQL files to create the production-like schema, followed by MySQL’s INFORMATION_SCHEMA metadata tables to compare both to ensure similar results. (Sadly, an initial attempt to just compare the table creation strings ran into problems since there are multiple ways to create tables and columns that are identical in use but have different string representations — for example, column attributes like NOT NULL are not order-sensitive, but a straight string compare will fail if they are not the same order.) This also requires a ‘known-bad’ list of existing models and tables that didn’t match; however, we are working to reduce this list. Along the way we ironed out places where engineers followed our SQL rules but didn’t match that in Django — for example, where Django did not have a TIMESTAMP-equivalent model field, and we needed to create one.

This schema validation step directly runs into the CI/CD timing problem. It requires engineers to put their schema alongside their code (to ensure they match). However, then we run into the fact that we must apply the schema from a branch, or block the CI/CD pipeline while the schema alter runs. To make this clearer, let’s cover some examples.

Engineer creates PR with code and schema ‘ADD COLUMN’ alter side-by-side

‘Simple’ CI/CD ‘best-practices’ Flow

Schema validation ensures code matches final schema. Schema and code matches, so time to apply schema!
Engineer merges PR to master (following CI/CD best practices).
Release process sees schema alter, starts updating database schema
… 24 hours later, schema alter still in flight. Release queue blocked (as code on master branch cannot be shipped, since it depends on schema alter). Engineering team revolts.

Non-CI/CD ‘best practices’ Flow

Schema validation ensures code matches final schema. Schema and code matches, so time to apply schema!
Engineer notifies schema alter process of available schema alter (after PR “finalized”).
Schema alter applies schema alter from branch.
24 hours later, schema alter process notifies engineer of completion.
Engineer merges PR to master (following CI/CD best practices).
Note: if other code merged to master conflicts with engineer’s PR so that the schema no longer makes sense, this is a problem.

Engineer creates PR with code and schema ‘DELETE COLUMN’ alter side-by-side

‘Simple’ CI/CD ‘best-practices’ Flow

Schema validation ensures code matches final schema. Schema and code matches, so time to apply schema!
Engineer merges PR to master (following CI/CD best practices).
Release process sees DELETE schema alter, releases new code then starts updating database schema.
… 24 hours later, schema alter still in flight. Release queue blocked (as code on master branch cannot be shipped, since it depends on schema alter). A bug was discovered in PR, but the fix cannot be released because schema alter is still in flight. Customers revolt.
Note: in theory, the schema alter does not need to block the release queue, except a later release may attempt to alter the same table (say, for a bug fix).

Non-CI/CD ‘best practices’ Flow

Schema validation ensures code matches final schema. Schema and code matches, so time to apply schema!
Engineer notifies schema alter process of available schema alter (after PR “finalized”).
Engineer merges PR to master (following CI/CD best practices).
Schema alter applies schema alter from branch.
Engineer creates PR with schema, and separate PR with code (based on schema PR)

‘Simple’ CI/CD ‘best-practices’ Flow

Schema validation for schema PR fails because code no longer matches production.
Engineer adds table to ‘known-bad’ list.
In theory, this could add a link to the code PR instead, and the CI pipeline could also check out that PR, and then use the code PR to compare against the schema PR; we have not yet wanted to add this level of indirection.
Merge schema PR to master (now table in ‘known-bad’ list, so the code may start diverging; or any future CI runs will need to check alternate code PR for the given table and enure master branch is backward compatible).

Currently, Udemy has come down on the side of applying schemas from combined pull requests once engineers are sure they are done with the schema design. This does have problems when the schema alter is applied, but the PR containing it is then abandoned — but this does not happen often. We plan to update this in the future to become a semi-automated multi-PR process; we will have a ‘submit_schema’ script that extracts the schema from a pull request into a separate PR that also adds the adjusted tables to a backward-compatible list (if the alter is a create or update; if a delete, then it will remove it from the backward compatible list). The code PR will then be adjusted to update the backward compatible list appropriately. We will also have Github hooks to ensure that, if the code PR is abandoned, a PR to remove the schema update is generated — and monitors to warn if tables stay on the backward compatible list for too long.

We are also concerned with ensuring we are spending the right amount on our automation work. For example, currently we only test code against the ‘current’ version of the database schema in the pull request. To absolutely ensure schema backward compatibility, we could also be running the current release branch against the updated schema from the pull request. Since this would require us to essentially double the number integration tests that we run for every single PR with a schema alter, it would require increasing noticeably the size of our testing infrastructure. So far, following our backward compatibility rules has been sufficient.

Of course, even with these rules and validations, there is more automation in our future. For example, we currently use GitHub’s CODEOWNERS files to ensure any pull requests that contain schema alters need approval from our database specialist team. This allows those specialists to complete any checks that we haven’t yet automated, as well as actually apply the schema to production. However, we’d like any non-controversial schemas to be automatically queued for production application once their PR is approved. We also will be expanding the schema validation to work for other languages that our services support. There is always more work to be done in this area — if you’re looking for this type of work, please get in touch with recruiting@udemy.com!

Schema Management at Udemy

Challenges

Solutions

Engineer creates PR with code and schema ‘ADD COLUMN’ alter side-by-side

Engineer creates PR with code and schema ‘DELETE COLUMN’ alter side-by-side

Written by Rusty Larner