dbplyr : A Path to More Inclusive Data Transformations at the ACLU
Here at the ACLU analytics team, we have quite a few useRs — and at least one ‘R’ celebrity who you may recognize every time you use her package to have Gucci Maine tell you when your R script is done. Because we use R so much, for some of us, it’s much easier to write dplyr than SQL.
But, other parts of our organization and team use python, pure SQL or access our data warehouse from BI tools. When the task calls for it, we’ll use whatever language we need — no language wars please! Don’t @ me … Ultimately, we want everyone to be able to use the data easily, no matter what language they’re using. So, when data scientists use esoteric programming languages (R or Python) to perform large-scale data transformations or perform feature engineering — all in their local environments, they are missing an opportunity to be inclusive in their work: they don’t create reusable data workflows that others on the team can benefit from. Transformed model features from data science projects can often be repurposed into more accessible reporting variables and contextualize future data projects.
Fortunately, we’ve found a solution that works for us, and overcomes the limitations that each of raw SQL and dplyr have on their own. The trick to this is using dbplyr as a meta-programming language — we use it to generate SQL queries. For those unfamiliar with dbplyr, it is a layer that exists to convert R code — written in dplyr — to SQL code that can be executed lazily on the server only once a user decides to pull the data into their local environment — and not before. For the typical person using R, this backend works seamlessly in the background to translate commands, and there are minimal differences to their workflow. We simply copy the dbplyr-generated-SQL (modify after the fact if needed), and build into tables in our warehouse. For the remainder of this post, I will walk through some examples of this in our work. The rest of the article assumes you have a basic familiarity with R, the tidyverse and/or at least one variant of SQL. If you want to learn more about any these first, the following resources would be helpful:
TLDR : we use the Tidyverse’s dplyr to generate complex SQL code in our data warehouse.
Often, there are queries that are hard to construct in a sensible way in either SQL or dplyr alone. In those instances, its especially nice to bring the two together. As an added benefit, dbplyr’s linear piping structure can make code review of the complicated logic much easier than reviewing the complex SQL alternative.
Example 1 — Interval Joins — “At the time of a gift, how often did the person give in the prior year?”
In a given year, we can receive hundreds of thousands of individual donations — we are a member driven organization. Often, staff will want to answer questions like “how much did a donor give 3 months before their last gift?”. Our donation tables looks approximately like this, with each donation from a specific individual:
To answer the above question at the individual gift level requires an interval join. We join the donation table back to itself with “between” logic based on gift date. Very feasible in SQL, not so much in dplyr. But, when combined together, very easy to do.
In the script below, we use the sql_on capability added in version 1.4.0 of dbplyr to perform the self-join, where the left table is the original donations table, and the right is itself limited to all gifts within the specified prior period (here, 48 months), joined by account_id and a “between” statement for the gifts on the right table. We then turn this into a function so we can repeat with whichever rolling periods we need in further analyses.
With all that up and running, the data scientist can go on with their analyses, no problem. But, nobody else benefits from their efforts. Fortunately, we can take the final step and turn the dbplyr into a table in our data warehouse that everyone can benefit from! Notice below we can add all types of other conveniences the tidyverse provides us — like re-arranging our columns easily, adding useful commentary to our SQL so no one modifies by hand, etc.
Example 2 — Gift Sequences — “How many gifts had this person given before? they gave a particular?”
A donor’s “journey” at the ACLU can be complicated — a person may give once when they see clear violations of civil liberties in the news, then later decide they want to make a deeper commitment and start recurring donations. We often want to visualize and understand common paths, and providing a table of ‘gift sequences’ allows us to answer questions varying from “For those who convert from single time to recurring donations, how soon after the first donation does the second occur?” or “If a credit card is about to expire, how long does it take for us to update the payment method?”.
Here, we use window functions. In dplyr, these don’t feel all that different from normal operations thanks to mutate* functions, but in SQL these can often require complicated partitions and, at least intuitively, can be challenging.
We start with the same donations table we highlighted above. We then build a sequence logic out, in a function, that we can apply to different types of gifts to create different chronological sequences — ie nth transaction, nth completed gift, nth recurring payment starting from a parent pledge made.
With the above in place, it’s easy to make many different types of sequences, which we can then write to SQL and schedule for regular updates.
I’ve now walked through a couple of examples of writing dplyr to generate SQL code, and hopefully convinced at least some of you that it can be a valuable way to write tricky data manipulations in more readable and reusable ways.
However, if you just add the final outputs from these to whatever workflows exist in your ELT/ETL processes, you may expect some slowdowns. Dbplyr isn’t necessarily going to be as efficient as manually adjusted code or code built in a workflow engine. For instance, both the interval joins and gift sequences described above can be parallelized with the right re-shuffling. In a future post, I’ll walk through how we use dbt with dbplyr to make following and engaging with our member/donor journeys both easy to do and performant.