Writing history, at least data history
As I am always NOT finding what I am looking for I start writing it down myself. In today’s episode I will be talking about keeping data history in the classic relational database-world (sorry BigData, this is not for you) — a short list to keep things in mind. I will not discuss the pros and cons of the multiple approaches, people like Ralph Kimball already did that pretty well.
Generating data history is always a good idea as things and the data they generate are changing faster and faster. And if you want to look back to a certain point in the (data-)past you better got your history sorted. Just imagine your biggest customer OmniCorp moving from sales district A to B. Without history all past sales will also move to district B as the customer entry connected to all these orders says the address is in B. Too bad for Jim, district A’s sales manager.
So to know better (and to salvage Jim’s income) we keep track of changes, normally by generating multiple entries/records for a single customer. In this example we keep one with OmniCorp’s address in district A and one in district B.
How do we differentiate between those two? Normally with a validity timeframe set by two dates, like “ valid from” and “valid to”, or at least one date which signals the start or end of the validity period. The second option is not cool as SQL queries using this approach tend to look weird.
And by using this we always get the matching entry for every order (imagine a smiling Jim here).
First step for writing history is noticing that things changed. And there are multiple ways to identify and extract changed data:
- built-in CDC (Change Data Capture)
A very clever database functionality which allows you to query a source database for exactly the changed records. Very helpful, but only available in some databases. Check your documentation (or Google) to see if your sources are capable of this. The syntax hasn’t been unified up to now so there’s no example here.
- technical fields (identifiers, creation and update dates)
The classic CDC — recognising changes in the data itself. First you identify how far you know the data already (maximum date, maximum key-value etc.) and check the incoming data only for values above these. Works on pretty much any data source which provides either usable dates or strictly ascending technical keys.
Or what-to-do-if-the-other-things-won’t-work. Again a technique using the data itself to identify the changes. Simplest approach is concatenating all columns for which you want to track the changes and using a standard hash function on them, e.g. MD5. Do the same for the incoming data and the existing data and then compare the hashes. If they are different for a record it has changed and should be processed.
- actually comparing
This should only be a fallback as it is very work-intensive. In this case you really compare the same record on a field-by-field level and mark/process them if they have changed. Imagine this with a multi million records table and you see the time this takes.
Now we know which records have changed, what to do with this info? Insert the changed ones (and new ones as they changed from non-existing to existing) into our target table.
- Check your target with the identifiers (primary key) you just identified as new and maybe stored in a separate table for existing records. Some use separate key tables for logging and clarity reasons but they are not mandatory. If using an ETL tool this is most likely called a lookup, if using SQL try a left join to the target table and check the joined fields for NULLs and non-NULLs.
- Depending on the result there are 2 ways:
a) the check signals no existing record with this identifier so insert it. Don’t forget to fill the validity dates with content
b) the check signals existing record(s) so we have to update and insert. The last valid records for this identifier has to be updated with a new validity end and a new record has to be inserted like in scenario a).
- bonus step: check source for deleted records
Use the known identifiers from your target to check if all identifiers still exist in the source. This can only be done if you got all source data available, meaning if you are using a delta approach (only gathering new data) you cannot do that.
So where’s the code?
Most of this can be done automatically with your favourite ETL-tool so no need to write some code. If you are more into pure SQL you can start here (yes, that is from 2008).
This article is prone to extension as more noteworthy things come to mind.