The Database Design Change Blues

Performing in the Key of Python

Steven F. Lott
Capital One Tech
9 min readMay 8, 2019

--

Once upon a time, I played bass in a bar-band. We had some medley transitions between songs that — to us — were signature rock ’n’ roll. For me, however, the medleys were gut-wrenching. Some nights there would be a tiny moment when I was thundering along with one beat, and I suddenly began to wonder where I was in the song and what was going to happen next. Was this the intro to Gloria? Or was this the outro?

To make it a little worse for me, whenever we rearranged the set lists during the first gig I would just draw a total blank. I knew what we were doing, but I couldn’t quite locate where I was on the roadmap from here to the big finish at the end of the set.

But this is really a story about SQL database design and how Python can come to the rescue. For me, Python is the trusty keyboard player who can transpose in her head into whatever key we wound up in. This was a moment of musical chaos that resolved into the right song in spite of the bass player losing his place.

SQL Databases as Songwriting

To me, a database is a song actively being played. The music of the database is a living thing with the performers and audience all bound together. The schema of a database is the chords and lyrics of the song. Imagine the chaos when the band stops, tunes, argues over the solo, and then restarts. That’s fine for rehearsal (or in IT parlance, “development”) but that’s not what the audience came to hear. We can, however, modulate the song as part of the performance.

Soloing on this analogy, I have two distinct views of a schema change:

  • It’s a change to the song. We need to be prepared and shift to the new chord progression smoothly. There’s a longer arc behind the music, and the part we’re playing now fits into a bigger, more complex piece. For me, this is how the long-form music by the old-school progressive rock bands like Yes feels.
  • It’s a new song. We’re going to smoothly move from the old song to the new song even if they’re different. This feels like a jam-band, like the Grateful Dead’s approach to playing a medley of songs with free-form transitions between the songs.

There are tools to help manage schema migration. The essential concept of migrating from one SQL schema to a new schema is something we’ve been doing for decades. We might use a tool like Flyway or Liquibase to help manage this. Since I’m a Python person, using Alembic to manage schema changes in SQLAlchemy schema sounds good to me.

What can we do when we’ve tried really, really hard to keep the database upgrade scripts under configuration control… (drum break) but… (drum break, big chords) We. Still. Have. Doubts?

The song is playing, the database is in use. The drummer has just given me a meaningful look. Something’s supposed to happen, and… Am I supposed to drop the new beat? What song are we transitioning to anyway?

The Essential Problem

Whether or not you need to rebuild a production database from the very beginning depends a lot on the context in which your database is being used. For installable products, the installation script has to build the current release of the database schema. For enterprise information systems, however, a production database at any given moment is the end result of a sequence of schema changes. It’s a bit like a very, very long song. In some cases, reaching back years.

Our latest epic was to prepare an application for wider internal distribution. This revealed a possible disconnect among the players in my sprint team. The current, production state of the schema and the repository of scripts were plagued with uncertainty, fear, and doubt. The beat was suddenly hard to follow. Most — but not all — of the incremental change scripts had similar-looking name following the date-slug.sql pattern. When there were dates, they followed a year-month-day format. The slugs tended to be descriptive. But when we saw some anomalous names, we weren’t sure this will build a production-like clone. Who’s taking the solo? If we don’t choose in the next three beats, there will be an awkward-sounding silence.

There was a separate repository a “one-time startup” script, but it was old, and appeared to be incomplete. This script muddied the waters. We were blind, and playing by ear. The database had become like a rolling stone with no direction home. We needed to examine the SQL in detail to see what was going on.

Python To The Rescue

We have a sequence of files in our GitHub repo. We can, therefore, make git log requests to find the history of each file. This lets us determine the history of the file. For the most part, the initial commit is the relevant date for keeping things in order. However, sometimes other dates might be helpful.

We’ll define a handy Python NamedTuple subclass to hold the dates and the path. Then we can extract the log details for a given path in the git repository:

from typing import NamedTuple
from pathlib import Path
import datetime
import subprocess
class Source(NamedTuple):
first_date: datetime.datetime
last_date: datetime.datetime
path: Path
def git_dates(path):
“””First and last commit dates”””
command = [
‘git’, ‘ — no-pager’, ‘log’, ‘ — pretty=format:%at|%s’, ‘ — ‘,
str(path)]
output = subprocess.run(
command, encoding=’utf-8', stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, check=True
)
def parse_log(text):
timestamp_unix, message = text.split(‘|’)
timestamp = datetime.datetime.fromtimestamp(
float(timestamp_unix))
return timestamp, message
commit_log = list(map(parse_log, output.stdout.splitlines()))
first = commit_log[-1]
last = commit_log[0]
return Source(first[0], last[0], path)

The NamedTuple subclass, Source, is a handy way to bundle three pieces of data together, the first commit date, the last commit date, and the path. Now that we have Python 3.7, using a frozen dataclass is almost exactly as nice as the NamedTuple.

The git_dates() function runs the git log command for a particular file to give us all of the authorship dates. (The exact definition of the %at format is “author date, UNIX timestamp.”) Each line of the output will have two values, separated by the pipe character. The internal parse_log() function splits the two values apart, converting the timestamp to a proper datetime.datetime object.

Creating a list of the parsed timestamps makes it pleasant to use the Python [-1] for the last entry in the list, chronologically first. The [0] is the first entry, chronologically last in the history of the module. While it might make sense to request the dates in proper order, the musical inversion of the date ordering is only a little confusing. The last commit seems to reflect bug fixes and pull-request finalization; it doesn’t appear to tell us when the SQL was first designed. The first commit seems to better reflect the design of the SQL.

While there are several libraries for git access from Python, I’m partial to using subprocess.run(). It provides a simple, portable way to do this kind of one-time data collection.

Standardized Names

The various script names can be standardized with a function like the following:

pat_1 = re.compile(r”^(\d{4}-\d\d-\d\d)_?-_?(.*)$”)def preferred_name(source):
m1 = pat_1.match(source.path.name)
if m1:
migration_id, slug = m1.group(1), m1.group(2)
else:
migration_id, slug = source.first_date.strftime(“%Y-%m-%d”), source.path.name
return migration_id, slug.replace(“_”, “-”)

The pat_1 regular expression matches the existing file names. There is a mixture of optional _ and — characters in the file name. The regular expression located all the variations properly. I didn’t need a pat_2 because, well, odd as the files looked, the regular expression matched them all.

The preferred_name() function selects among two alternatives for the preferred name:

  • If the file had a date-like prefix, this is honored.
  • If the file didn’t have a date-like prefix, the first git authorship date is used.

Other policies could make sense here, depending on a work-group’s use of git. This particular one seemed to produce a correct sequence of updates in spite of irregular names.

The possibility of complex name changes feels like the clever little chord substitutions that happen in longer songs. A song with more than two or three verses will often have some changes thrown in. A jazz arranger might replace CMaj7 with Em7 to avoid simplistic repetition. It’s fun in a jazz club settings. It’s not fun when building a database.

Keyboard Solo

This isn’t all that was involved, of course. The SQL scripts were actually part of two schema definitions. There was no easy way to distinguish based on the file names. We had to peer inside each file to look for table names buried in the SQL. This can be decomposed into a pair of regular expressions to check for any of the table names.

schema_a = re.compile(r“\W(TABLE1|TABLE2|TABLE3)\W”, re.M)
schema_b = re.compile(r“\W(TABLE4|TABLE5|TABLE6)\W”, re.M)

These two patterns become part of a get_schema() function that decides which pattern matched.

There’s some subtlety to this. At first, it seemed like literals or comments could happen to have values that mix up the table matching. It seemed appropriate to provide a little flexibility by returning a count of tables in each schema. Consider the following:

text = source.path.read_text().upper()
def to_pat(names):
return re.compile(rf”\W({‘|’.join(names)})\W”, re.M|re.I)
pa = to_pat(SCHEMA_A_NAMES)
pb = to_pat(SCHEMA_B_NAMES)
pa_tables = set(t.group(1) for t in p1.finditer(text))
pb_tables = set(t.group(1) for t in p2.finditer(text))

This builds a regular expression pattern from a list of table names. We can then apply this pattern with the finditer() method to locate all of the matches. The group(1) method of each match object will be the table name; the values of pa_tables or pb_tables is the list of tables from a given schema actually named in the SQL script.

We can use a len(pa_tables) and len(pb_tables) to assess the likely content. It turns out all of our files were either len(pa_tables) > 0 and len(pb_tables) == 0, or len(pa_tables) == 0 and len(pb_tables) > 0. Any hand-wringing over false positive schema matching was needless. The files can be neatly partitioned by schema.

It turns out this requires some advanced knowledge of the final collection of table names in each schema. This is a query against the production database metadata.

In more complex situations, the table names might not be unique. This can lead to a more subtle set of matching rules to distinguish the intent behind a sequence of SQL statements where a simple peek at the table names isn’t a good indicator. I didn’t have this problem, so I don’t have a clever bit of Python to solve it.

My very first band experience was a keyboard-bass-drums group where almost every song was essentially an extended keyboard solo. There’s a place for virtuosic musicianship, but perhaps it doesn’t belong in every song. It turns out we can easily partition the files with a simple pattern match.

Key Change

Given the schema information, a script can then decide on a preferred directory and a preferred file name for each SQL script. The final decision-making looks like this:

base = Path.cwd()
(base/”schema_a”).mkdir(exist_ok=True)
(base/”schema_b”).mkdir(exist_ok=True)
sql_iter = base.glob(‘*.sql’)
sources = [git_dates(path) for path in sql_iter]
sources.sort()
for source in sources:
migration_id, name = preferred_name(source)
folder = get_schema(source)
if folder is None:
print(f”*** problem with {source} ***”)
continue
shutil.copy2(source.path, base/folder/f”{migration_id}_{name}”)

The script applies two rules to the collection of SQL scripts. The preferred_name() function provides the date and a normalized slug for each script file. The get_schema() function partitions the flat list of files into two subsets.

How did the repository get like this? I’m reminded of the lyrics in The Revealing Science of God by Yes, “What happened to this song we once knew so well?” The pull-request history involves a large number of people, so it was a team-wide gap in following our norms for schema changes. This is a kind of musicianship issue — we thought we knew the song, but, somehow we turned it into discordant confusion. Once it’s sorted out, however, we can continue the song with a lot more confidence.

Now comes the hard part. As the song evolves, we need to be disciplined about sticking with the band’s style. The SQL script needs to contain the changes in a way we can reproduce without taking heroic measures. We need to listen to each other a little better. The changes to the song shouldn’t be seen as disruptive: the song evolves and changes, and we need to make sure we’re in harmony.

Related:

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2019 Capital One.

--

--