Finding Gaps with SQL

Published in

Learning SQL

5 min readJun 27, 2022

I’ve spent the last 10 years working with a variety of companies, on a variety of data, and one problem seems to rear its ugly head almost universally: gaps in the data.

A quick google search yields some basic guidance, but nothing directly helpful. It is my intention to walk you through the solution I built in Rasgo, and share the SQL code so that you can do it yourself.

Real life example 1

I was advising a large retail customer with building a time-series model for predicting sales, for each of their product SKUs. Before we could even get to the data science, we had to arrange the data into a proper time-series structure. We realized, through sheer luck, that some SKUs had “test data” that showed up when the online ordering team had done some tests.

I purposely exaggerated the problem in the following chart, so you can see what I mean:

Exaggerated “gap” with test data from the online ordering system

The problem looks quite obvious, but imagine if you had thousands of products. In that case, the problem gets completely masked.

To make matters worse, data scientists are trained to fix gaps in the data, because legitimate gaps do indeed exist when there are 0 sales (or stores are closed on Sundays).

Real life example 2

An energy company was analyzing IOT sensor data, and was getting strange results in their analysis, which appeared to violate the natural laws of physics. It turned out that sensors would occasionally stop reporting metrics like Temperature, while continuing to report things like Voltage. So any analysis that was done during those time periods, had been incorrect due to the missing data.

The correct thing to do, would have been to isolate only the “good” periods of time for the analysis. There needs to be a quick and easy way to find this!

How to identify gaps

One of my core principles is to keep all of the processing inside the data warehouse where the data is. Python is cool and convenient, but SQL is much faster and I don’t have to worry about the data being too big for my laptop. Spark is cool and all, but that requires a separate infrastructure for computation, and companies are starting to consolidate compute into their warehouses. Therefore, I needed to use SQL.

Islands

Before we look for gaps, the best approach is to consolidate the data into islands. The concept of islands also lets us add a “buffer” of time to each observation, so we have the flexibility to define how much time elapses before we consider it a gap.

Example — a buffer of 7 days means that if we ever see 7 days elapse without a valid row, we will consider that 2 islands separated by a gap.

Islands, Step 1 — Identify a valid condition

For most cases, you might be thinking a valid condition is simply that a row exists. But we want flexibility to be able to say, “I expect that a valid row has a voltage reading between 0 and 250.” That way, we can work around situations where the reading might come through as a row, but with a value of something like -99999.

WITH CTE_CONDITION AS (
  SELECT 
    OBS_TIMESTAMP AS dtm, 
    MACHINEID
  FROM 
    {{ table }}
  WHERE 
    OBS_TIMESTAMP is not null
    AND TEMPERATURE BETWEEN 0 AND 250

Islands, Step 2 — Lead and Lag prior observations

Next, we’ll want to calculate a LEAD and LAG for the prior observation. This will give us a basis for consolidating rows together into the same “island.”

CTE_LAGGED AS (
SELECT 
  dtm, 
  MACHINEID, 
  LAG(dtm) OVER (
    PARTITION BY MACHINEID 
    ORDER BY 
      dtm
  ) AS previous_datetime, 
  LEAD(dtm) OVER (
    PARTITION BY MACHINEID 
    ORDER BY 
      dtm
  ) AS next_datetime, 
  ROW_NUMBER() OVER (
    PARTITION BY MACHINEID 
    ORDER BY 
      CTE_CONDITION.dtm
  ) AS island_location 
FROM 
  CTE_CONDITION),

Islands, Step 3/4 — Calculate Island Starts/Ends

Next, we increment an island_number via adding our buffer of time. We must remember to also identify the first observation by the fact that the previous_datetime is NULL.

CTE_ISLAND_START AS (
SELECT 
  ROW_NUMBER() OVER (
    PARTITION BY MACHINEID 
    ORDER BY 
      dtm
  ) AS island_number, 
  MACHINEID, 
  dtm AS island_start_datetime, 
  island_location AS island_start_location 
FROM 
  CTE_LAGGED 
WHERE 
    DATEDIFF(day, previous_datetime, dtm) > 7 
    OR CTE_LAGGED.previous_datetime IS NULL
),
CTE_ISLAND_END AS (
SELECT 
  ROW_NUMBER() OVER (
    PARTITION BY MACHINEID 
    ORDER BY 
      dtm
  ) AS island_number, 
  MACHINEID, 
  dtm AS island_end_datetime, 
  island_location AS island_end_location 
FROM 
  CTE_LAGGED 
WHERE 
  DATEDIFF(day, dtm, next_datetime) > 7 
  OR CTE_LAGGED.next_datetime IS NULL
)

Islands, Final Logic — Calculate Island Starts/Ends

Finally, we can write a query to summarize the islands by joining the starts and ends by the island_id .

SELECT 
  CTE_ISLAND_START.MACHINEID, 
  CTE_ISLAND_START.island_start_datetime, 
  CTE_ISLAND_END.island_end_datetime, 
  DATEDIFF(
    day, CTE_ISLAND_START.island_start_datetime, 
    CTE_ISLAND_END.island_end_datetime
  ) AS ISLAND_DURATION_day
FROM 
  CTE_ISLAND_START 
  INNER JOIN CTE_ISLAND_END 
  ON CTE_ISLAND_END.island_number = CTE_ISLAND_START.island_number 
  AND CTE_ISLAND_START.MACHINEID = CTE_ISLAND_END.MACHINEID

Islands

We’re done! If you look at the example below, you can see how some machines have 2 islands, which indicates that there is a gap!

… but what about Gaps?

Identifying islands was the first step. Once you have identified islands, then the gaps are relatively easy. If you think about it, a gap is really the space between two islands.

Therefore, gaps is simply:
(1) LEAD to find the “next” island, by machine
(2) DATEDIFF between the island_end and the next_island_start

If you got this far, the final step is trivial — so I will spare you the code.

Repeating the Process

The ultimate question is, how can I repeat this process every time I start work on a new project? It turns out, this entire SQL can be used as a template with just a few parameters that change:

table name
buffer date part and size
datetime column
partition/group-by, if any

Making this type of work repeatable is what I specialize in, here at Rasgo. So, last week I built and deployed the template to share with the world.

If you aren’t using Rasgo and you’re just looking to jumpstart your SQL, I also built the function summarize_islands into our free online tool, SQL Generator. (Broken Link — Rasgo has transitioned to using AI instead of these templates for queries)

Always feel free to reach out if you want to chat; you can usually find me in slack. I’m particularly active in Locally Optimistic and DataTalks.Club, and you can always ping me directly at Rasgo.

Go from SELECT * to interview-worthy project. Get our free 5-page guide.

Finding Gaps with SQL

Real life example 1

Real life example 2

How to identify gaps

Islands

Islands, Step 1 — Identify a valid condition

Islands, Step 2 — Lead and Lag prior observations

Islands, Step 3/4 — Calculate Island Starts/Ends

Islands, Final Logic — Calculate Island Starts/Ends

Islands

… but what about Gaps?

Repeating the Process

Written by Josh Berry