Linting your database schema

When we think about legacy systems we tend to look from a software developer perspective and focus mostly (if not only) on the code. But what IT is all about is processing the data. Code is just the tool and a way to describe the process.

All mature enough programming languages have a set of style guidelines and linting tools that are there to help us keep the code in good shape. Dead code branches and unused variables are removed during refactors (some IDEs provide great help). Right?

Now, when was the last time you did the same, but with your SQL database schema? Did you remove this foo column that is no longer queried after the recent refactoring? Did you drop an index that consumes gigabytes of database storage, but is no longer needed? Oh, so you’re saying it’s DBA’s duty?

Just before we start NoSQL vs SQL discussion. Source: http://www.commitstrip.com/en/2014/06/03/the-problem-is-not-the-tool-itself/

Introducing index-digest — MySQL database schema linter

Since the end of 2016 I am a member of a team that deals with legacy systems at Wikia. Our job as The Sustaining Team is to make them stable and well documented. Most of the challenges (and success stories that followed them) we faced were related not solely to the code, but rather to the data storage. It can be a MySQL table that keeps track of every single edit action on the site since 2007 (470+ mm of rows, 300+ GiB in size) or Solr index (that grew to 400 GiB with 250 mm documents) storing fields no one really knows.

Code evolves, so should the database schema / data storage.

Common database schema pitfalls

Let’s first define the problem and typical cases that we keep stumbling upon:

  1. outdated SQL schema dumps in the code (solution — introduce a script that can generate them automatically via mysqdump)
  2. redundant indices: e.g. foo index on column A and bar index on columns A and B (yes, there were even the cases of two identical indices defined)
  3. unused columns, indices or even tables
  4. tables keeping the data from the beginning of the universe when only the last N weeks are needed

Solution

Identifying the problems described above requires the following: databases that the code relies on need to be identified, schema of tables used needs to be extracted, SQL logs need to be collected, queries need to be run through EXPLAIN to extract the information on indices usage…

Sounds like a lot of manual work and hence prone to errors? But why not automate the entire process and rely on The Computers Magic performing the database linting?

index-digest example report of redundant index

index-digest (inspired by Percona’s pt-index-usage) was implemented to perform this task. Just provide it with DSN-style URL including database address and credentials. index-digest will take the schema definition from all tables and list all suggestions regarding redundant indices. The goal is to provide the user with actionable reports instead of just a list of boring statistics and schema details.

You can also provide a file with example SQL queries (you do log your SQL queries, right?). They will be run through EXPLAIN — unused columns and indices will be listed in the report.

Here’s a report example from aforementioned production table:

An example report for production table

index-digest checks

As of now the tool performs the following checks:

  1. redundant_indices: reports indices that are redundant and covered by other
  2. not_used_columns: checks which columns were not used by SELECT queries
  3. not_used_indices: checks which indices are not used by SELECT queries
  4. not_used_tables: checks which tables are not used by SELECT queries
  5. queries_not_using_index: reports SELECT queries that do not use any index

Success story

Remember the MySQL table that keeps every single edit action on the site since 2007 (470+ mm of rows, 300+ GiB in size) that I mentioned earlier in this story? For years it was labelled with do not touch this, there be dragons and used to scare fellow employees ;)

Carta Marina (published in 1572) with Legacy Systems Monsters in northern seas surrounding Faroe Islands and Iceland

But thanks to quite a simple tool and fact-based approach (instead of a gut feeling telling I don’t think that we use this column) we found out that:

  • out of 12 (yes, twelve!) indices two were redundant and four more can be removed as no longer used by any query. Obviously, indices weighted three times more (!) than the data itself.
  • two columns were not used by SELECT queries anymore (we first renamed them to __foo, alter table drop column followed after a week — just to make sure nothing really uses them)

Results? The table weighted around 320 GiB before. After we dropped no longer needed indices its size is now less than 160 GiB (50% smaller, index size dropped by 61%).

Similar process was performed on three, quite big shared tables and three per-wiki tables.

As a result, just by removing unused indices, we managed to reclaim around 1.31 TiB of disk space on all MySQL replicas. Thanks to proper tooling and SQL logs it took us only few days.

If you like the tool, spread the word, visit https://github.com/macbre/index-digest/issues and suggest new checks. Enjoy and make your data storage great again!