Introducing Observable, Self-Documenting ELT

Saurabh Bhatnagar
7 min readJun 20, 2020

--

Virevol ai is a collaborative visual shopping app. We run a remote-first lean team, have been for 2+ years. To do our magic, we swim with a lot of data.

We get millions of rows from 100s of sources every day, run our models on them and serve them in a variety of ways. We are on the Google stack and have GCP cloud functions, services, containers, you name it. They trigger other events on file drops, queue messages, hooks, schedules. Everything is event-based, and just provisioned in time (We are stingy).

There are too many data paths to keep in our heads.

With observable ELT, we can trace the provenance of data down to each row, with matching logs, inputs, versions of every job that touched it. We can do ELT accounting in dollars, a pipe dream for most CTOs. This enables us to rework parts of the pipe that are cost us more, not all legacy code.

Here is how we keep sane? And how do we ship fast while keeping data quality?

How data moves

ETL is a series of batch jobs. When one finishes, the next one is triggered. Sometimes two, sometimes they are Y-joined back to one.

You get a directed graph. Graphs are easy™.

Graphs go from top to bottom (or left to right if you like).

Photo by Mark Basarab on Unsplash

The complexity arises from handling when, some step in the middle, breaks, or takes too long. Think Rube Goldberg machine.

Or worse, if nothing breaks. And you find the error days down the road while making sense of data.

To manage this complex pipeline coordination, people use all sort of great tools — Luigi, dbt, Airflow, etc.

It’s a very top-down way of instrumenting.

The end-user cares only about the last step. If the report needs to go out at 9am and the data isn’t ready, you’ll get a call. That’s your SLA.

If a service needs refreshed recommendations for a user every 10 minutes, that’s your SLA.

The end-user sadly is not impressed by DAGs or fancy tools.

When things break, a tearful junior developer tries to debug ‘why’ at 3am. She joined The Company to get her hands dirty with neural nets. But was baited and switched to doing ETL. Worse, it is called ‘Data Janitorial Work’ by some, not ‘Data Lorax’ as it should.

She toils through some documentation of what should have happened. If the person who wrote that still works there, a bit of Slack. Then going through the logs to make sense of things. If she is really lucky, there might even be a mostly up to date runbook for this job. She hops from Pagerduty to JIRA, to wiki, to github, to logs.

Birds sing in a distance to welcome the break of dawn.

Observable ELT

We introduce a new concept called a Global Trace Identifier, or tid for short.

Every function, batch job or service that will transform data needs to get a tid for that specific run.

It provides all old tids, its version, and instance variables (e.g. X-AppEngine-QueueName, X-Appengine-Taskname). These are logged in BigQuery in service_tracker table.

Here is what a call looks like in python

It has to log that run with this tid.

When done, it has to append that row’s last_tid column to this). If it is a new row, it has to also populate tid column to that row. Scroll down to “Slowly Changing Dimensions” section for optimizations.

Here is the deployed Google function that does all we need for this.

How does it work?

Let’s draw a graph outlining these cases:

Here A, B, C,…F, are all different processes. Each takes a thing, wakes up, sends a thing and shuts down. But we do not need to have a documented graph like this one anywhere.

So how do we generate it?

From the perspective of table row that got inserted at the end of D, we have tid of A, and last_tid of D that traces to [B], and B traces to [A].

So we know, A -> B -> D -> this row. We can trace every row here to the run, and code version that generated it.

Let’s do a more complex case.

From the perspective of the table row that got inserted at the end of F, we have the tid of A, and last_tid F, that traces to tids for [E, D]. We can trace each one of the tids mentioned in the list to reverse the graph.

When something breaks (F), we won’t get a row for it in the data warehouse. Our fearless developer can look at the previous run that references F as a prev_tid (they are timestamped). She can then debug the log of F’s run directly and will find that rows affected by that broke in E.

She can do all this without asking anyone else. You still need runbooks for service F, right in github.

She goes from pagerduty, to Google logging and github for code and issues. If she finds an issue with the service, she can update and push code for it right from there, which triggers CI/CD tests.

She can document right in code and point to the tid that helped her find that issue. The code is the best ‘current’ documentation. The table is a time series of pointers to it.

We can do more than that, we can audit how long each table takes to produce and which jobs along the chain cause it. We can figure out when the chains changed (ML job C got inserted in the mix).

We can figure out which legacy code needs rework by looking at the cost of each run for each service. This enables a new way to optimize. We optimize not by just ‘feeling’ which code irks us the most, but by an objective function — cost.

Think of it as profiling a program, except you see it in terms of functions and dollars, not subroutine timings.

Let’s talk about small but important details to get better results.

ELT not ETL

Since your data is in the cloud, moving data is the largest share of your bill. Do ELT instead of ETL wherever possible. The data will be transformed right on the server and saved back, minimal transfers needed.

And it will be faster for another reason. BigQuery and most modern data warehouses are doing map-reduce for you on their large cluster of servers. You would have to spin up a k8s instance to get the same level of performance on a large load.

To get the most bang for your buck, use database specific SQL, no abstractions to transform data. Writing SQL is the right tool for the job in this case. Get over it.

How complex can SQL get? As an exercise, here is tf-idf being calculated in SQL, just for fun (In practice you’d use something like spaCy because you probably need more steps). You might get styling tips out of the sample too that make SQL enjoyable.

For observable ELT, use a Google function that fetches the tid and runs the parameterized SQL from a dictionary.

Something like this,

If you can do it all in SQL, then use the merge statement directly to be faster (see sample).

Oh, and make sure you have test cases for all this fancy SQL you now have.

Slowly Changing Dimensions

We use three other niceties that make life beautiful.

One is to wrap every table behind a view. So a table named ‘students’ becomes ‘v_students’. This enables easy refactor. It’s something I came up at Rent The Runway and kept it.

The second is to use a version of SCD type 2.

For most important tables, every row is only ever appended, and partitioned by time. The view on top then gives us the latest version of that table using a rank() over added_ts.

This might seem wasteful, but it keeps us sane, and cost of storage isn’t as high as one might think. We can always drop old partitions when it gets too large.

Lastly, every job is restartable. This is usually a standard for most mature data teams, so I won’t dwell on it.

With observable ELT, we had to invent a couple of new tools, but we do not need any other external tool for coordination. It coexists with Google Cloud Task Queues, Pub/Sub, and Apache Beam pipes.

So far we are happy. We will let you know if we still are once we have a slightly larger team servicing a Billion customers.

These ideas can be extended to logging as well. We recently came across this thread on logging and are excited to try it out.

We would love to hear from you on if this gave you any ideas about your own ETL flow. This decentralized approach should help you ship faster.

If you like to work on challenging problems, and are an unconventional thinker, please reach out.

--

--

Saurabh Bhatnagar

Making palatable meaningful data products from terabytes of data #ai #ml #stats Sr Data Scientist at Rent The Runway, previously Barnes & Nobles, Unilever,..