Validate Your BigQuery SQL Table Metadata Using Just 1 Word
Leverage BigQuery SQL table metadata to deduplicate, partition and delete data — all using only one word.
There are some queries I write so often that I genuinely believe that if I fell asleep on my keyboard, when I would wake up, they’d be entered and running in my SQL environment. Certain functions, especially datetime and timestamp conversions, fall into this category. As do a few simple quality assurance queries.
One such query that gets row counts by filtering on a given upload date is a go-to to determine how many rows loads to a table in a day.
That looks like this:
SELECT DATE(dt_updated) AS dt_updated, COUNT(1) AS rowCount
FROM `project.dataset.table`
GROUP BY 1 ORDER BY 1 DESC
Executing the above query looks like this:
The reason this works so well and so precisely is because we’re able to filter rows by a provided date. This dt_updated column isn’t part of the data I ingest in this case. It’s calculated by the script at runtime, looking something like this:
import pandas as pd
dt_updated = pd.Timestamp.utcnow()
But if I hadn’t included that, there would be no reliable, dynamic date field available for filtering. If your data pipelines function predictably this is ok, but not ideal.