EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Database pointers

Fred Friis
Expedia Group Technology
14 min readJun 29, 2021

--

Messing up a database can mean lights out — let’s avoid that!

Nighttime satellite view of Earth from above. Messing up a database can mean lights out — let’s avoid that!
Photo by NASA on Unsplash

I used to think creating and managing a database (“db”) and a schema for a given service was easy, that I knew all about how to do it.

I was wrong!

There are actually lots of things that need to be kept in mind in order to avoid easily preventable issues to do with databases and how they interact with your code and tests — this post will cover a bunch. Parts of it is pretty technology specific (eg. AWS RDS and MySQL specific), but much of it is universally applicable for all db technologies, so still worth reading even if you use some other db technology. (including noSQL)

I hope you will find it helpful — and that you will comment and point out any errors or things you think are missing that are worth mentioning!

Creating and managing databases

At Expedia Group™, AWS RDS is commonly used.

One of the main points of AWS RDS is to make creating and managing databases easy, but there’s still a surprising amount of complexity and potential pitfalls that should be kept in mind when using RDS, much of which isn’t at all made clear in AWS own (or other) documentation and tutorials.

Entire db deleted by CloudFormation

If your RDS was created using CloudFormation, subsequently tweaking the CloudFormation down the track can result in your RDS being deleted and recreated from scratch, losing all data in the process. It goes without saying this could be disastrous, so be very, very careful with CloudFormation!

Bad defaults for version

In absence of a specified version of MySQL when creating an RDS, AWS will default to MySQL 5.6, which was released in 2013 and is no longer maintained. RDS’s should not be created without specifying a version, or specifying the version as MySQL 5.6 or 5.7 — use MySQL 8. Likewise for other db technologies, make sure to explicitly specify the version and not rely on defaults.

Bad defaults for other settings

Different databases have different defaults (and some have better defaults than others), but make sure to sanity check and tweak the settings according to best practices and your use case.

Eg. in MySQL, the default charset is latin1 (meaning your db will not properly store or render anything other than a weird hodgepodge of latin characters) and the default collation is latin1_swedish_ci (meaning your db will be unable to select and sort in any sensible way) which is more or less guaranteed to not be what you want.

Do not use any of the non-mb4 utf8 charsets or collations. They are not actually UTF-8 — utf8mb4 (which is actually UTF-8) has replaced them. (confusing!)

Do not use the utf8mb4_general_ci collation. It's incorrectly implemented and utf8mb4_unicode_ci has replaced it. (again, confusing!)

Note that the utf8mb4_bin (=binary, exact matching) collation may be more appropriate than utf8mb4_unicode_ci for certain use cases, but not for general purpose use. So using utf8mb4_unicode_ci as the default and specifying utf8mb4_bin for individual columns as required when creating tables is recommended. In depth explanation of charsets and collations are beyond the scope of this post, but as a brief example:

| id | name |
| --- | --- |
| 1 | Zoë |
| 2 | Zoe |

| query | collation | result |
| --- | --- | ---|
| select * from user where name = 'zoë'; | utf8mb4_unicode_ci | both users |
| select * from user where name = 'zoe'; | utf8mb4_unicode_ci | both users |
| select * from user where name = 'Zoë'; | utf8mb4_unicode_ci | both users |
| select * from user where name = 'Zoe'; | utf8mb4_unicode_ci | both users |
| select * from user where name = 'zoë'; | utf8mb4_bin | no user |
| select * from user where name = 'zoe'; | utf8mb4_bin | no user |
| select * from user where name = 'Zoë'; | utf8mb4_bin | only Zoë |
| select * from user where name = 'Zoe'; | utf8mb4_bin | only Zoe |

i.e. utf8mb4_unicode_ci is good for general purpose use (names etc.) and utf8mb4_bin is good when exact matching is required (usernames etc.).

Optimally (from a technical point of view, not necessarily a user experience point of view, but still), things like usernames would be constrained to case-insensitive ASCII, so you don’t end up with a user zoe, another user Zoe etc etc.

But if the cat is already out of the bag and there are already such similar usernames, matching exactly becomes extremely important. Imagine if you implemented a login service (or an email service, or whatever really) that didn't properly disambiguate between the username zoë and the username zoe, or between the username zoe and Zoe!

In AWS RDS, default settings for RDS instances are controlled through parameter groups — more on that will follow.

Parameter groups

Unfortunately, AWS does not make it very clear that creating dedicated parameter groups for exclusive use by each individual db before creating the db is a de facto prerequisite

Parameter groups are a broad range of settings for the db, including things like what the default charset and collation is.

If a db is created without dedicated parameter groups, it gets assigned the default parameter groups. The default parameter groups have the bad default settings described earlier, which you don’t want. And even if the default parameter groups did have better defaults, the default parameter groups still shouldn’t be used, because if other devs may make changes (outside your knowledge and control) to the default parameter groups, those changes will then be reflected in your db irrespective of whether you want them to or not!

This is why creating dedicated parameter groups for every db — before creating the db — is required.

There are two types of parameter group: “DB Parameter Group” and “DB Cluster Parameter Group”- a dedicated one of each is required for every db.

So in case you wanted to create an Aurora MySQL 5.7 db for foo-service, start by creating a dedicated foo-db “Parameter Group” and a dedicated foo-db “DB Cluster Parameter Group”. (Note while Aurora MySQL 5.7 will be used as an example, more or less the same issues are relevant for other dbs as well, so do read this).

Which parameters to override with what depends on the db used. For Aurora MySQL 5.7, some more or less required overrides for the foo-db DB Cluster Parameter Group are

| Parameter | override | default.aurora-mysql5.7 |
| --- | --- | --- |
| binlog_format | MIXED | OFF |
| character_set_client | utf8mb4 | <engine-default> |
| character_set_connection | utf8mb4 | <engine-default> |
| character_set_database | utf8mb4 | <engine-default> |
| character_set_filesystem | binary | <engine-default> |
| character_set_results | utf8mb4 | <engine-default> |
| character_set_server | utf8mb4 | <engine-default> |
| collation_connection | utf8mb4_unicode_ci | <engine-default> |
| collation_server | utf8mb4_unicode_ci | <engine-default> |
| server_audit_events | CONNECT,QUERY_DCL,QUERY_DDL,TABLE | <engine-default> |
| server_audit_logging | 1 | 0 |
| log_bin_trust_function_creators| 1 | <engine-default> |
| server_audit_logs_upload | 1 | 0 |

The foo-db Parameter Group does not need any overrides — but a dedicated foo-db Parameter Group for the exclusive use of the foo-db is still needed for reasons already mentioned.

If you do not create dedicated parameter groups for exclusive use by your db before creating the db, and the db is created using the default parameter groups, you CAN create the desired parameter groups and modify your db to use them even after the db has been created, but it requires a hard restart of the db and the changes in charset, collation etc. will not take effect for already created schemas - the defaults for already created schemas will be whatever they were at the time the schemas were created - so you will have to manually go through each created schema and change them per schema, possibly also by table and column, as required. You don’t want to have do that! It’s always better to create the dedicated parameter groups for exclusive use by your db first and create the db pointing to those parameter groups in the first place.

Creds

Many dbs have common and widely known default creds for the root user, like “root” or “master” or “admin”. Optimally, use a different one when creating the db, and optimally, store it securely and never expose it.

Obviously, also choose a strong password, and obviously, store it securely and never expose it. The AWS reference documentation for creating an RDS using CloudFormation ( https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/quickref-rds.html) contains this unsafe snippet, which leaves the creds (including the password!) in plain text in the CloudFormation file.

"MyDB" : { 
"Type" : "AWS::RDS::DBInstance",
"Properties" : {
"DBSecurityGroups" : [
{"Ref" : "MyDbSecurityByEC2SecurityGroup"},
{"Ref" : "MyDbSecurityByCIDRIPGroup"}
],
"AllocatedStorage" : "5",
"DBInstanceClass" : "db.t2.small",
"Engine" : "MySQL",
"MasterUsername" : "MyName",
"MasterUserPassword" : "MyPassword"
},
"DeletionPolicy" : "Snapshot"
}

RDS’s cannot be created using this method or any method that ever exposes creds for the root user in plain text anywhere. Even if the creds were immediately changed, there’s still a window of time where an attacker could log in using those creds and create another user which can still be used to access the db even after the root user creds are changed.

There are other problems with that snippet as well: in absence of a specified version of MySQL, AWS will default to MySQL 5.6 (which is very old and is no longer maintained), also, it doesn’t use dedicated parameter groups.

Last but not least, while AWS use the term “master”, that doesn’t mean we have to — when referring to the “master” creds, prefer using terms like “admin” or “root” instead in order to avoid needlessly awkward situations in a diverse work environment.

IAM Authentication

The root creds should not be used by your service to connect to the db.

The old fashioned way to wire your service to the db is by logging in to the db using the root user and root user password, creating a dedicated service user for the service and assign it the required (but not more, i.e. no create or drop table permissions) permissions on the relevant schema, and then adding the service user username and password (via Vault, AWS Secrets Manager or something similar) to the service property files.

The new way to wire up your service to the db is by using IAM, which works much the same but doesn’t require storing the password for the service user anywhere.

Database design

These are just some general pointers, I’m sure more could be thought of!

Don’t assume the default types will fit your domain, eg. that the default VARCHAR length of 255 characters will accommodate the range of possible values for a text column — check your domain and tweak the column types accordingly.

Don’t use non-human readable columns (including varbinary for id columns) unless there’s good reason. Trivial performance benefits are not good reason!

Don’t use auto-incremented, sequential, guessable id’s. (at least not if they’re going to be exposed — if they’re merely a db-internal identifier that never gets exposed, it’s not as problematic from a security point of view)

Don’t use prefixes like table_ or tbl_<table_name> — it’s a table, no need to add a redundant, unidiomatic prefix.

Don’t use abbreviations like usr_ppt_mrkt_mgr — figure out proper, idiomatic names for tables. If table names feel like they have to be abbreviated because they’re long combinations of words, that’s a smell that the table probably does too much and can and should be split into individual, atomic, dedicated tables.

Don’t turn tables into a hodgepodge of non-essential columns that don’t really constitute the core thing the table represents (especially if a column is nullable, that’s a strong indication it doesn’t really belong in the table).

Eg. looking at this user table, there’s a strong sensation that the best_friend_user_id and favorite_band don’t really have anything to do with what constitutes a user:

mysql> select * from user; | user_id | username | best_friend_user_id | favorite_band | 
| xxx | foo | null | null |
| yyy | bar | zzz | null |
| zzz | smurf | yyy | The Smurfs |
...

instead, create individual, atomic, dedicated tables for each thing:

mysql> select * from best_friends; 
| first_user | second_user |
| xxx | xxx | ...
mysql> select * from favorite_bands;
| user_id | favorite_band |
| xxx | The Smurfs | ...

This will make your code types cleaner as well — instead of:

data class User ( 
val userId: String,
val username: String,
val bestFriend: String?,
val favoriteBand: String?
)

The User class will now only contain what constitutes a user (and they will be non nullable, i.e. guaranteed to always be present as well!).

data class (
val userId: String,
val username: String
)

You can get the potentially absent best friend and favorite band from dedicated functions that looks in the appropriate tables.

fun findBestFriend (val user: User) : User? 
fun findFavoriteBand (val user: User) : String?

Migrations

There are different tools for managing database migrations and state, and which one you use and how is to some degree a matter of taste. That said, I want to make the case for using Flyway over Liquibase.

Here’s what Flyway migrations look like

$ tree . 
└── migration
├── V1__bootstrap_tables.sql
├── V2__role.sql
├── V3__role_trigger.sql
├── V4__role_insert.sql
├── V5__grant_and_other_tables.sql
├── V6__grant_and_other_triggers.sql
├── V7__insert_types.sql
└── V8__insert_more_member_types.sql

Literally just a simple sequence of scripts that make it immediately obvious what changes will be applied in what order.

Also, Flyway runs all scripts on app boot. This has huge benefits:

  • if any of the scripts fail, the app doesn’t boot and you’ll know immediately. This makes it more or less impossible to end up with non-working changes (including due to incorrect sequence of execution) on the main branch. (because your feature branch won’t build!) If, despite that, non-working changes still made it in to the main branch somehow, non-breaking changes won’t be deployed, because again, the app canary won’t boot in neither test nor prod.
  • it also enables you to know the state of the scripts — any script that has made it to a given environment will have run in that environment! Simple.
  • it enables you to combine code changes that depend on a db migration together with their corresponding db migration in a single cohesive change (vs having to do the db changes first in one change and then do the code changes that depend on the db migration in a subsequent change).

Note also: no folder structure, metafiles, manual script execution etc. necessary, so the possibility of errors to do with any of that is completely eliminated.

Here’s what the Liquibase migrations for this repo looked like before we switched over to Flyway

$ tree . 
└── changelog
├── README.md
├── db.changelog-master.yaml
├── grant
│ ├── changelog.yaml
│ └── v.01
│ ├── 2019-05-02_01_create_grant.sql
│ ├── 2019-05-28_01_Update_index.sql
│ ├── 2019-09-20_01_drop_table.sql
│ └── changelog.yaml
├── portfolio
│ ├── changelog.yaml
│ ├── v.01
│ │ ├── 2019-05-02_01_create_db.sql
│ │ └── changelog.yaml
│ ├── v.02
│ │ ├── 2019-09-26_01_portfolio_idempotent.sql
│ │ └── changelog.yaml
│ └── v.03
│ ├── 2020_11_18_01_portfolio_resource_col.sql
│ └── changelog.yaml
├── ram
│ ├── changelog.yaml
│ └── v.01
│ ├── 2020-11-18_01_create_user_resource.sql
│ ├── 2021-02-17_01_add_deleted_column.sql
│ └── changelog.yaml
├── resourcepolicy
│ ├── changelog.yaml
│ └── v.01
│ ├── 2019-07-23_01_create_schema.sql
│ ├── 2019-07-25_01_add_constraint.sql
│ ├── 2019-07-29_01_add_id.sql
│ ├── 2019-11-09_01_subject_index.sql
│ ├── 2020-09-03_01_nullable_partner.sql
│ └── changelog.yaml
└── role
├── changelog.yaml
└── v.01
├── 2020-12-10_01_create_role.sql
├── 2020-12-10_01_insert_role.sql
├── 2020-12-10_01_role_audit_h2.trg.sql
├── 2020-12-10_01_role_audit_mysql.trg.sql
└── changelog.yaml

Way more complex!

While I can see where wanting to organize things into subfolders according to domain or functionality or whatever comes from, it’s questionable whether the added complexity makes it worth it.

It’s common when using Liquibase to defer execution of scripts, i.e. the scripts are not automatically run during app boot, instead, an endpoint is hit specifying what script to run and when.

Intuitively, this sounds great since it enables complete control and customizability of what to run when. Unfortunately, it quickly turns into a nightmare because:

  • since the scripts aren’t run during app boot, non-working changes can easily make their way into the main branch and all the way to prod without anyone noticing they don’t work… until they’re manually run — only then will you find out there’s something wrong with the script, and now you have to figure out how to deal with the non-working script.
  • It means you don’t know the state of the scripts — a script could have made it out into the main branch and all the way to prod but only have run in test, or not at all, or only in prod… Dangerous! In the case of this app, some of the above scripts had run in all environments, some had not been run in any environment, yet others had only ran in test but not prod… Needlessly and extremely confusing.
  • It makes it impossible to combine code changes that depend on a db migration together with their corresponding db migration in a single cohesive change. If your app boots with code changes that depend on the script having been executed, and the changes haven’t been executed, the app will break. So you need to first make just the db change, execute it everywhere, and only then add the code that depends on it. It needlessly splits things that could be a single cohesive change into needlessly complex and confusing chunks.

Note also all the metafiles etc. There’s just a lot more complexity and different ways to very easily cause problems.

A common counter-argument is that Liquibase is more “powerful”, eg. it supports “rollbacks” etc. out of the box. But, in my opinion, this comparison is a really good example of the power of simplicity — Flyway being less powerful is a feature, not a bug, because its simplicity completely eliminates many sources of easily preventable errors. It’s similar to how Kotlin avoids entire categories of Java defects. (see https://proandroiddev.com/kotlin-avoids-entire-categories-of-java-defects-89f160ba4671)

As for “rollbacks”, there’s really no such thing as a “rollback” when it comes to database schemas. Database migrations are kind of like git — there’s only an ever progressing sequence of changes. A later change can effectively “override” a previous one, but the previous one wasn’t really “undone”. In Flyway, if you need to change some previous change, you simply add a new change that puts you in the desired state.

Check out Flyway here: https://github.com/flyway/flyway

Note that it’s pretty easy to migrate to Flyway from Liquibase even if the db has already been created. Just create a V1__bootstrap.sql file with “create table if not exists” statements, triggers etc. to reflect the current state of the db and that’s it.

H2 vs MariaDB4J

H2 has advantages over connecting to a real db to run tests, including being much faster, less resource intensive, and self contained, as you don’t need an external real db.

Unfortunately, with H2, you quickly run into problems due to differences in supported syntax between whatever db dialect you’re using in production and H2. Eg, custom code-based triggers have to be written for H2, which are not guaranteed to behave the same as the triggers in the real database. In many cases, simple queries have to be changed as well because H2 doesn’t support the same types of JOINs etc as the real dbs.

So at the end of the day, H2 is really fast and efficient at… giving you a potentially false sense of security your changes will work as expected in real environments.

An alternative middle ground between connecting to a real RDS instance to run tests and H2 is MariaDB4J.

Like H2, it’s an embedded in memory ephemeral db used for testing, but unlike H2, it’s a “real” db that supports all the same features as whatever you’re actually using in prod, and doesn’t require workarounds like code-based triggers etc. So while it is a bit slower than H2, it gives much better guarantees that your changes will work as expected in real environments.

Check out MariaDB4J here: https://github.com/vorburger/MariaDB4j

Note that it’s pretty easy to migrate to MariaDB4J from H2 even if the project was based on H2. Just pull in the plugin and change the test config files to point to the MariaDB4J url for the db connections and that’s it.

Why is this all so needlessly complicated

I don’t know! Most of us probably think creating and managing a database and a schema for a given service is an easy, streamlined process (especially with RDS!) that doesn’t require much thought, but unfortunately, there’s quite a few things that need to be kept in mind.

It’s unfortunate that dbs don’t have better setting by defaults, that docs aren’t better, etc. but it is what it is and we just have to work around it, paying attention to detail!

--

--