Converting Looker PDTs to the Dataform framework

Published in

Dataform

8 min readApr 24, 2020

Dan is the Founder of Outlier, your fractional data team | weareoutlier.com

Over the last few years Looker, probably more than any other BI tool, has changed the way that analyst teams approach reporting and business intelligence. Looker encourages you to describe the relationships between your data (using explores) and allow business users to then freely explore data themselves: the self-serve analytics we’ve been talking about for years.

On top of this, Looker forces analysts to define those relationships with LookML: a simple markup language for describing dimensions, metrics and relationships between datasets. Instead of using drag-and-drop GUIs, the BI environment is defined as code (using LookML). Analyst teams can then version control their work, and focus on building a product as opposed to responding to ad-hoc requests.

The final piece of the transformational puzzle that Looker has brought to the world of data are PDTs (persistent derived tables). PDTs can be used to quickly and easily set up SQL-based transformations on your raw data. This gives analysts the ability to create their own datasets, removing the dependency on other teams (for example data engineers).

These three changes — enabling self-serve analytics, collaboration through version control, and the reduced dependence on other teams — mean that teams using Looker significantly outperform teams working with BI tools that lack these features.

Whilst this toolkit that Looker offers data teams an initial step-change to their productivity, many growing teams find that PDTs are no longer fit for purpose.

Managing performance — both in terms of speed and cost — is difficult, with no visibility into which PDTs are most resource intensive.
When there are data quality issues created by PDTs, it’s hard to find the root cause.
Tables created by PDTs are poorly optimised for outside of Looker (with obtuse names like tmp.LB_faa_e_flights_pdt), making data discoverability for teams that don't access data through Looker challenging.
Visibility into which PDTs already exist and how they’re related is lacking, making it difficult for new team members to get up to speed and start contributing

To address the challenges described above (and more), we built Dataform: an open source framework specifically designed for managing data modelling and transformation with SQL. We share a similar philosophy to Looker: namely that data & analytics teams have lessons to learn from software engineering on how to collaboratively build complex products. We also believe, similarly to Looker, that the most natural environment for analysts to work is a web-based one, so we also offer a fully-integrated web based IDE for managing data transformations with the Dataform framework.

Sessionization with Dataform and Looker

For the rest of this post, I’ll walk through the steps of moving a sessionization model from a Looker PDT to the Dataform framework. By the end, we’ll have a Dataform project running the sessionization pipeline on a regular schedule, and a Looker view pointing to the output of the pipeline.

I won’t go into too much detail on sessionization (there’s a good post about it here). But, very roughly, we’re taking a large table of user events (a user performing a certain action at a certain time) and clustering them into groups of events ( sessions) by a particular user that all happen in one continuous period.

Here’s the sessionization logic written as a PDT:

sessions.view.lookml

view: sessions {
  derived_table: {
    sql_create:
with events_cleaned as (
-- filter out invalid / old events and remove duplicates
select distinct
  timestamp,
  user_id,
  name as event_name,
  id as event_id
from
  source.events_raw
where
  event_name not in ('test_event', 'clicked_logout_button_old')
),session_starts as (
-- label the event that starts the session
select
  *,
  coalesce(
    (
      timestamp_diff(timestamp,
                     lag(timestamp) over (partition by user_id order by timestamp asc),
                     second) >= 30*60, -- 30 minute session timeout
    true
  ) as session_start_event
from
  events_cleaned
),session_index as (
-- add a session_index (users first session = 1, users second session = 2 etc)
select
  *,
  sum(if (session_start_event, 1, 0)) over (
    partition by user_id
    order by
      timestamp asc
  ) as session_index
from
  session_starts
),session_id as (
-- add a unique session_id to each session
select
  *,
  farm_fingerprint(
    concat(
      cast(session_index as string),
      "|",
      cast(user_id as string)
    )
  ) as session_id
from
  session_index
)select
  session_id,
  user_id,
  session_index,
  min(timestamp) as session_start_timestamp,
  max(timestamp) as session_end_timestamp,
  count(distinct event_id) as events_in_session
from
  session_id
group by
  session_id, session_index
    ;;
  }  dimension: session_id {
    type: string
    sql: ${TABLE}.session_id ;;
  }
  
  dimension: user_id {
    type: string
    sql: ${TABLE}.user_id ;;
  }  dimension: session_index {
    type: number
    sql: ${TABLE}.session_index ;;
  }
  
  dimension: events_in_session {
    type: number
    sql: ${TABLE}.events_in_session ;;
  }  dimension_group: session_start_timestamp {
    type: time
    sql: ${TABLE}.session_start_timestamp ;;
    timeframes: [
      date,
      time,
      raw
    ]
  }
  
  dimension_group: session_end_timestamp {
    type: time
    sql: ${TABLE}.session_end_timestamp ;;
    timeframes: [
      date,
      time,
      raw
    ]
  }  measure: sessions {
    type: count
  }
  
  measure: users {
    type: count_distinct
    sql: ${user_id} ;;
  }
  
  measure: events {
    type: sum
    sql: ${events_in_session} ;;
  }
}

Step 1: Splitting sessionization into 3 steps

The sessionization logic is relatively complex: this example has ~70 lines of SQL, but in many cases it can amount to hundreds of lines of code. We’re also doing several different things in one go:

Cleaning up our lists of events
Finding the session_id and session_index for each event
Aggregating the list of events into a sessions table

Long SQL queries doing many things at once are hard to understand and hard to debug. We’ve designed the Dataform framework to make it easy to split your SQL into logically separate pieces that can be chained together. So, let’s split the sessionization SQL into those three steps: cleaning, adding session details, and aggregating.

events_cleaned.sqlx:

config {
 type: "view",
 schemaSuffix: "prelim"
}
 
select distinct
 timestamp,
 user_id,
 name as event_name,
 id as event_id
from
 source.events_raw
where
 name not in ('test_event', 'clicked_logout_button_old')

sessionized_events.sqlx:

config {
 type: "view",
 schemaSuffix: "prelim"
}
 
with session_starts as (
select
 *,
 coalesce(
     timestamp_diff(timestamp,
                    lag(timestamp) over (partition by user_id order by timestamp asc),
                    second) >= 30*60, -- 30 minute session timeout
   true
 ) as session_start_event
from
 ${ref("events_cleaned")}
),
 
session_index as (
-- add a session_index (users first session = 1, users second session = 2 etc)
select
 *,
 sum(if (session_start_event, 1, 0)) over (
   partition by user_id
   order by
     timestamp asc
 ) as session_index
from
 session_starts
)
 
-- add a unique session_id to each session
select
 *,
 farm_fingerprint(
   concat(
     cast(session_index as string),
     "|",
     cast(user_id as string)
   )
 ) as session_id
from
 session_index

sessions.sqlx:

config {
 type: "table",
}

select
 session_id,
 user_id,
 session_index,
 min(timestamp) as session_start_timestamp,
 max(timestamp) as session_end_timestamp,
 count(distinct event_id) as events_in_session
from
 ${ref("sessionized_events")}
group by
 session_id, session_index

By breaking the SQL up, we’ve achieved a few things:

We’ve created datasets for each step in the process. This will help with debugging, and also allow us to test each step (more on that later)
It’ll be easier for new team members to understand and contribute to the code
We can give the output of each step specific table names, and use schemas to separate intermediate tables (dataform_prelim.table_name) from our final tables (dataform.table_name).

After splitting things up, the ref() function (part of the Dataform framework) allows you to express dependencies between transformations. It also allows Dataform to generate a dependency graph, helping you visualise your data modelling layer:

Step 2: Improve performance with incremental table builds

The PDT is set up to rebuild the entire sessions table from scratch each time it is updated — this means rescanning the entire events_raw table each time. For small companies the event table can be small, so each refresh can be relatively quick. But as organizations grow, the events tables can become very large. Rescanning the entire table can get really slow (and, if you're using BigQuery or Snowflake, expensive too).

Given that we know event data doesn’t change retroactively, we’d rather only read the data that’s changed since our last refresh. Once we’ve processed the new data and generated new records, we can append these new records to the existing table. By reducing the amount of data processed, the processing time (and processing cost) can be drastically reduced.

The Dataform framework supports this optimised table update methodology with incremental tables. The changes to make events_cleaned incremental are shown below:

Incremental logic for sessionization can be relatively complex: you need to ensure that session indexes increment correctly, and make sure you don’t split sessions up in the wrong place. I won’t go into the details in this post, but here are the examples of incremental versions of sessionsed_events and .

Step 3: monitor data quality with assertions

Now that we’ve converted the models to be incrementally built, they’ll update faster and cost less each time they update. But we have paid a small price in complexity. To give confidence that the sessionization is doing what we think it should, we should write some tests that check the output data meets our expectations.

The Dataform framework makes it easy for analysts to write tests on output data using assertions. Once we’ve added assertions to our transformation pipeline, Dataform will check that the assertions are valid on every update. If an assertion fails at any step, an alert (via email or slack) is sent to the team. This means that analyst teams are the first to find out when there are data quality issues — rather than finding out from business stakeholders!

In the example below we’ve added assertions to check that there is only one row per event_id, and that each of the fields are non-null.

sessionized_events.sqlx:

config {
  type: "incremental",
  schema: "prelim",
  uniqueKey: ["event_id"],
  assertions: {
    uniqueKey: ["event_id"],
    nonNull: ["timestamp", "user_id", "event_name", "event_id", "session_index", "session_id"]
  }
}

with session_starts as (
select
...

Because we’ve split the transformation into 3 steps, and added assertions after each step, debugging data quality issues is far easier: assertion failures pinpoint the exact point in the pipeline that something went wrong.

Now that the tested and incremental sessionization pipeline is set up in Dataform, we can schedule it to be updated at a regular interval. In this example, we’ll schedule the entire pipeline to run every hour. This can be configured in the Dataform web UI, with the configuration version controlled in the schedules.json file:

schedules.json

The last thing to do is update the LookML for the sessions view to point at the sessions table we are now updating each hour:

sessions.view.lookml

view: sessions {
  sql_table_name: dataform.sessions ;;
  }
  
  dimension: session_id {
    type: string
    sql: ${TABLE}.session_id ;;
  }
  
  ...

Dataform + Looker

Looker is a fantastic business intelligence tool that allows analysts to move away from ad-hoc reporting, instead collaboratively developing a data product to enable self-serve analytics across the organization.

PDTs are a great starting point for analysts looking to quickly transform data. There are, however, a few features they’re lacking which may hold your team back as it starts to scale. With the Dataform framework, we’re combining the ease-of-use of Looker, with data engineering best practices for managing transformation pipelines. Combining Dataform and Looker gives your data team the best of both worlds: a best in class BI platform, as well as a best in class data modelling framework.

Originally published at https://dataform.co.