The Funnel: A Fundamental Abstraction in Designing Autonomous Commerce Marketing Systems

Marcel Schoffelmeer
Bluecore Engineering
7 min readDec 3, 2018

In the past few years, promotional emails have begun to fill my inbox. I’ve started to rigorously unsubscribe from mailing lists that I either didn’t subscribe to in the first place or just wasn’t interested in. A few exceptions for me are technical book publishers that send me emails about the latest publications and offers, which allow me to jump on deals for books on subjects that I like to know more about.

The Funnel

So how does the publisher know that they should be sending an email to me rather than say, to my mother? To effectively send emails, funnels are used to find a targeted audience. Funnels are mainly used by marketers to describe the steps website visitors go through before they reach conversion. For building email audiences, we use a similar approach where we let a set of customers be filtered by steps, like in this example:

With each step the set of customers gets smaller and at the end we get a filtered audience to target.

Bluecore provides marketers a UI to build these funnels as illustrated in the example above:

What happens when the “Preview” button is pressed or when the audience is used in a scheduled email campaign, is what we will be discussing in this article. The backend is responsible for obtaining this audience from our database. To accomplish this, we needed to:

  1. Define an Interface Definition Language (IDL) between the UI (frontend) and backend to convey the funnel.
  2. Convert the IDL to a query for our retail data warehouse.

The Funnel IDL

We defined an IDL for the funnel to serve as a fundamental abstraction for marketing in general. This abstraction makes it possible for marketers to communicate complex behavioral marketing campaigns (in this case, as configured in the Bluecore UI). It also provides isolation from the underlying database. Currently we compile the IDL definition into a SQL query but this may change in the future should the underlying datastore change (e.g. schema layout or even switching from an analytical database to a streaming system).

The IDL has gone through many iterations/evolutions. It can now be used to express audiences for campaigns that:

  • run on an hourly basis/sliding window
  • are based on customer behavior or catalog changes
  • are “run once” campaigns
  • use multiple channels
  • etc.

To continue the example above, the UI funnel steps are mapped onto steps in the IDL definition:

- event: Viewed product
timeframe: last 7 days
filter:
- key: category
value: technical books
- event: Added to cart
timeframe: last 7 days
- event: Purchased
timeframe: last 10 days
inverted: true

The time frame attributes limit the range where we’ll look for events and the inverted marker indicates an exclusion (e.g. customers that have not purchased).

From Funnel to SQL

The customer data needed to build the audience is stored in an analytics database (BigQuery), with a typical schema such as a customers table or tables for actions like viewed product, added to cart etc:

We needed to build a SQL query that would return the required audience using the data in these tables.

Here, we opted to split processing into:

  1. A transformation process to turn the IDL definition into a SQL Abstract Syntax Tree (AST)
  2. A SQL string emitter that spits out the SQL from the AST

SQL AST

Using the IDL funnel definition as input, we then needed to create the corresponding SQL to execute against our database. While it may be tempting to take the IDL and write a complicated string manipulation function, this turned into an unmaintainable mess very quickly. As the IDL increased in complexity, changes became harder to implement and verify given the lack of structure in the code. Instead, we create an intermediate form, an abstract syntax tree (AST), that represents the SQL.

The following example is a syntax tree that represents a funnel with 3 steps, each represented by a sub-select (details left out as indicated by the striped arrows):

Note how the inverted (not) purchased step is translated to a left outer join.

The code transforming the IDL only needs to work with higher level AST objects such as a WhereClause, Select or Table, instead of being concerned with constructing the entire SQL string. Each specific AST class implements its own part of the SQL syntax and delegates complex SQL generation to child objects. Using this class-based approach we can:

  • Automatically give aliases to the sub-selects (as required by BigQuery), by incrementing a counter (e.g. t1, t2, etc).
  • Add behavior to the SQL element classes such as being able to ask a (sub) select what columns are selected (e.g. customer_id). For example, when generating the ON condition for a JOIN clause, the Join object determines this by matching columns selected from each side of the JOIN (e.g. “ON t1.customer_id = t2.customer_id”).
  • Generate aggregation functions for top level selection column references that are not part of the GROUP BY SQL clause.
  • Generate table decorators, a BigQuery feature to optimize query performance, for cases when we only access data that was added in the last 7 days.

Once we have the SQL represented as a tree structure, we can do things such as make modifications on a global scale, alter the tree for performance reasons, etc. These are changes that would be very hard to do correctly without having this form.

Limitations

Note that by specifically dealing with a known table structure and a funnel IDL we can make certain assumptions on the structure of the SQL tree. For a wider range of SQL variety, the ability to auto-generate parts of the SQL will diminish. In fact, for certain data science related scenarios we introduced a “HardCodedSelect” node in the tree that allowed our data science team to create an almost free-form SQL statement, as long as the result of the sub-select would fit the overall SQL structure.

SQL Emitter

Using the Visitor design pattern, the SQL emitter traverses the SQL tree while writing to a string buffer. So in the end, we are indeed piecing strings together! The SQL emitting process applies formatting rules making all the campaign SQL have the same format for easy reading and comparison.

Here is a simplified part of the (BigQuery) SQL statement for our example funnel to find people that viewed products but did not purchase nor have emails delivered for respective timeframes:

SELECT t1.email
FROM (
SELECT email, id
FROM [viewed_product_201806]
WHERE created >= "2018-06-15 12:00:00"
GROUP BY email, id
) t1
JOIN
(
SELECT email, id
FROM [viewed_product_201806]
WHERE created >= "2018-06-15 12:00:00"
GROUP BY email, id
) t2
ON t1.email = t2.email AND t1.id = t2.id
LEFT OUTER JOIN (
SELECT email
FROM [purchase_201708]
WHERE created >= "2018-06-12 12:00:00"
GROUP BY email
) t3
WHERE t3.email IS NULL
GROUP EACH BY t1.email
IGNORE CASE

We found that, on average, we get twice the amount of SQL lines compared to the source IDL. Note that this is a highly simplified example. Our actual production SQL can span as many sub-selects as BigQuery will allow, sub-selects can contain yet more sub-selects for certain scenarios, more WHERE filters etc., all driven from the input IDL.

The Result

The marketing funnel to SQL generation process described here resulted in our customers being able to come up with almost limitless variations in audience definitions while on the back-end we are still able to retrieve this audience with a single BigQuery query. At the time of writing, Bluecore executes more than 50,000 uniquely generated BigQuery queries per day using this method.

Over the last few years, different engineering teams in our company have added new features to the funnel IDL and implemented the SQL generation side with it to support customer needs. A large part of of our organization is familiar with the IDL representation of funnels, without needing SQL expertise while still able to reason about the behavior of funnels if necessary.

Hacking SQL

During one of our recurring hackathon events we implemented an alternative SQL generation module that generates “Standard SQL”, the BigQuery SQL dialect that replaces “Legacy SQL” on which we originally built our system (see https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql).

The separation between funnel definition and SQL generation allowed this to be a fairly simple task even for engineers who have not worked in this area. Our initial tests showed the queries performed better too (they were less BigQuery resource hungry)!

--

--