Rewriting live data with confidence
Data lessons from the trenches
The advice below follows a pattern we’ve used at Lifesum, something suitable for small to medium tech companies with most of their data in a monolithic relational database. If your interest is dev, ops or data, read on.
I’d like to share an experience that’s happened a few times for us at Lifesum. All software has bugs, and often these bugs leave a mark in a company’s database, in the form of missing, incomplete or incorrect data. The data might be:
- History. Then a user’s daily activities are not really affected, but analysis and statistics on top of that data could be quite broken.
- State. Then the problem means the product is not working for affected users. They will keep having problems until the data is corrected.
In either case, you might find that you need to do a bulk update in our production database to fix it. How can we do it in a sane way?
1. Stop the bleeding
Is the problem still ongoing? There’s no point aiming for a moving target. Try to find and fix the source of the bug first before considering patching. This way you will only have to patch once.
Is the bad data contagious? Sometimes a bad internal state—one you never designed for — might cause a torrent of new bad data to be created. This can escalate quickly and even take down part of your product. Since going back in time and coding more defensively is not an option, you might need to temporarily disable the product for affected users, just until you can stop new cases from emerging.
2. Decide if you need to patch
Changing a production dataset means taking some risks. Will you get a payoff that’s worth it from this change? Supposing users aren’t affected, what are your options?
Fix it in an ETL. Let’s say the data is history and you mainly use it for analytics. In a larger company you would use an ETL job to send this data to your data warehouse (for us, Google BigQuery). You may be able to leave the data as is and write an ETL job to clean it instead.
Create a cleaned view. Suppose you don’t have a data warehouse. A common pattern is to do analytics queries on a read replica of your master database. Can you make a materialised view of your data which is clean, and query that instead?
Of course, if users are affected in a noticeable way, you must patch. How can you do it safely?
To do this right, you should:
- Reduce the chance of human error. You’re already fixing one set of mistakes, best not to add a second set on top. This means taking the time to double-check and QA any patch before running it, and running it in a controlled way.
- Ensure you can roll back. In the unlikely event we do make things worse, we should always be able to get back to the current state of things.
With this in mind, let’s talk about making, QAing and applying a patch.
3. Write a query that finds the rows to fix
In order to fix the data, we have to know which rows are wrong. Since it’s likely a lot of data is wrong, we want to identify these rows in an automatic way, with a query. How this query will look will vary in every case.
Some tips when writing it:
- Translate your criteria for “bad” into constraints in the
WHEREclause of your query
- Use positive and negative examples to help hone your query down until it exactly finds the rows you need
- Use the date range of the problem that you’re fixing to constrain the number of rows you look at
- Include the problematic values as well as the unique identifiers for the rows you intend to fix; then save the results of this query as a quick backup of these rows, ideally to a new database table
- Accept 80% coverage if you cannot cover all the bad cases well, then you can repeat the process once these are fixed
At a high level, your query might look like:
SET @problem_onset = '2017-08-27';
SET @problem_offset = '2017-08-29';SELECT
created BETWEEN @problem_onset AND @problem_offset
AND problemfield > 134.2
In reality it may be a horrific spaghetti monster with ten joins. But the important thing is that it only covers bad cases, and that you have verified this by checking a few by hand.
Write a script that dumps this to CSV or JSON. If new cases are no longer emerging, every time you run this query you will get the same output.
4. Propose fixes for all problematic cases
In the best case, inside your query you can also propose the value the field really should have had:
problemfield AS problemfield_existing,
problemfield / 2 AS problemfield_proposed
If you have too much business logic, you should write a second script that works off your CSV/JSON dump and adds the an additional column of fixed values.
5. Autogenerate a literate patch to apply
The third scripted step is to auto-generate a literate SQL patch based on the rows that need updating. You may have a few options on how to write your SQL, but the goal is to make it:
Easy to QA. Include comments in your auto-generated patch that will let you more easily check cases by hand. For example, you might include in comments the old value of the field, or some explanation of why it violated constraints. Leave no doubt about what each
DELETE does and why.
Safe to run. Run the change in a transaction so that you can cancel if something goes wrong. Consider the locking behaviour of your database, and choose a granularity that will let you avoid breaking production loads whilst this change goes through. For example, row level locking might mean per row
UPDATE statements, or else statements that apply to a batch of primary keys at a time. Huge patches should be split into smaller ones to reduce the size of your transaction.
Putting these together, your patch might look like:
BEGIN;-- problemfield = -3415
UPDATE users SET problemfield = 100 WHERE userid = 2343;-- problemfield = -1324
UPDATE users SET problemfield = 50 WHERE userid = 98234;...COMMIT;
Putting it together with the steps we’ve covered so far, you might have pseudocode like:
df = detect_problem_cases()
df_fixed = propose_fixes(df)
patch = generate_sql_patch(df_fixed) save_backup(df_fixed)
6. Manually QA a random subset of changes
Now we have an automatically generated literate SQL patch, and we want to run it. This is the time for a final QA check.
Randomly sweep through a few cases and manually ensure they do what you expect. Check the maximum and minimum values and make sure any boundaries cases make sense too. If you can, get a second pair of eyes to review your work.
7. Apply and confirm
You have your backup and your patch, you’re now ready to apply it live. How you do this is going to depend on the access restrictions you have to your production database. Try to:
Run it in a remote screen. Use the
screen command on a server in your data center to run the change, so that it won’t be interrupted by your internet connection dying or you accidentally putting your laptop to sleep.
Avoid task queues. A once-off task in a task queue like Celery might seem the ticket, but actually it removes control from you, especially in failure scenarios. This kind of thing is just better done by hand, at least at the small-to-medium scale we’re talking about.
Once you’ve applied the patch, re-run the scripts that generated it. You should find, to your joy, that they generate an empty set of things left to fix!
Since bugs are inevitable, so are data issues. We’ve talked about a recipe for correcting these issues in production databases, one we’ve found productive at Lifesum.
When I joined Lifesum I was great at noticing data inconsistencies, but terrible at actually fixing them. In truth, we all were. Over time I realised that we only made fixes in the very worst cases, because we didn’t have a good strategy for reducing the risk of data fixes. The pattern we’ve shared here has changed this a bit, helping us to have higher quality data and a better user experience.
Do you do something different in your org? Have some critique? We’d love to hear about it. Write a comment to us below.