S3 and missing rows in Redshift
1. Redshift my love
At Appaloosa we love using Redshift. We wrote two articles about our migration from MongoDB to Redshift. In the second blog post we mentioned a script that we run between MongoDB and Redshift to verify data integrity in an interval of time. This script helped us a lot to find duplicates or missing records in Redshift. The amount of problematic rows was very low but we didn’t want to have missing data or duplicates at all.
When we send data to Redshift we add a unique id to each row, generated by our backend and a PostgreSQL sequence. We will see how it helps us find missing gaps.
We already dealt with duplicate rows but it appears that sometimes a bunch of rows are missing. Our existing script that compares MongoDB and Redshift worked but we faced two limitations :
- We didn’t validate data integrity using integer sequence
- We wanted to get rid of MongoDB
So we had to find another way to detect missing rows.
3. But why rows are missing?
Sometimes Firehose fails to create the `*.gz` file on s3, it retries, succeeds, but CloudWatch isn’t notified about the file object creation.
The last Lambda function (lambda not triggered in the graph) is in charge of putting S3 objects in Redshift and is not being triggered. This happens every 3–4 weeks.
4. Redshift constraint
To detect a list of missing integers in a sequence we need a reference. In PostgreSQL we could have used generate_series like Dimitri Fontaine shows in “Find the missing integer” but sadly generate_series is not fully supported in Redshift. If you try to replay the query from Dimitri with RIGHT JOIN you will get the following error:
INFO: Function “generate_series(integer,integer)” not supported.
ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
5. Finding missing data
The LAG window function returns the values for a row at a given offset above (before) the current row in the partition.
Here is a usage example on a temporary table:
This gives us an example of lag function. We can see data that we don’t need like the last `from_id` which is NULL.
We will deduct information about previous row if it is missing. For this we simply write:
false to replace
null values. It helps us to get rid of the last line of the rows in the final query.
Now the last step is easy. We uses a CTE because we love CTE.
For us, this query is executed from an AWS lambda and the formatted result forwarded to Slack. We use a custom time interval to check rows (for example from 3 hours ago to 1 hour ago). The lambda is cron task scheduled.
This query is probably not the most optimized but in our case this is really what we need and it’s fast enough. It takes 1.2s for 1 month of data. In our production scenario we check every hour for a couple of hours of data.
Further reading :
You should subscribe to our blog to be notified whenever a new article is available!
Want to be part of Appaloosa? Head to Welcome to the jungle.