Designing And Building An Event Tracking Backend with BigQuery

Published in

incentro

7 min readFeb 16, 2019

Comprehensive event tracking data in business analytics has become pretty common. Where for a long time event data was nearly synonymous with website click-stream data, it has now found uses in plenty of other areas (e-mail tracking, *cough* internet-of-things). There are several off-the-shelf offerings that help you store and prep data for analysis, but building your own solution might be easier than you think.

In this post I will walk through designing and building a real-time solution on top of Google’s Magical Data Warehouse of Wonders, more commonly known as BigQuery. With only (auto-)scalable components, the solution can start small, growing (and shrinking) as volume and throughput demand. The data source can be any application that is able to push data tagged with a user ID; so my scope here is getting those data stored and ready for analysis.

For this exercise we will be helping fictional Bob’s fictional company, “Bob’s Knob Shops”, with handling its click-stream data. Bob’s empire spans several shops, loosely matching knob type families. His customers live on every continent (including Antarctica) and his company never sleeps. Bob craves insight into user behaviour on his web shops. The event data from Bob’s site is limited to a timestamp, id, time zone, URL and referrer URL. These data are always both impeccably well-formed and semantically valid.

Onwards!

Specifying a data schema that matches your data access patterns is important for many database solutions and BigQuery is no exception. BigQuery is a column store and works best when the data are: 1) denormalised and 2) stored in a single table.

Storing all the data in a single table might seem troublesome, but BigQuery’s column types include arrays (lists) and structs (objects). This enables you to include data you would otherwise store in auxiliary tables. In doing so, you avoid the need for JOIN operations on (potentially) billions of rows.

Another best practice is to cluster your data (I strongly recommend reading Felipe Hoffa’s post [1] on it). Functionally transparent, data clustering will cause BigQuery to store related rows in close proximity. This can have profound beneficial effects on performance, and thus, costs[1]. So let’s use that as well.

The resulting table creation query would look like something like this. The data will be partitioned by date and clustered by channel (shop type) because Bob likes typically likes his statistics by channel and then by user — know your data access patterns.

CREATE TABLE `bobs_knob_shops.sessions` (
  channel_id STRING,
  visitor_id STRING,
  timezone_offset INT64,
  start_time TIMESTAMP,
  end_time TIMESTAMP,
  hit_count INT64,
  hits ARRAY<STRUCT< 
    timestamp TIMESTAMP,
    url STRING,
    referrer_url STRING 
  >>)
PARTITION BY
  DATE(start_time)
CLUSTER BY
  channel_id,
  visitor_id

Marvellous. And now we hit our first snag. Below is an example event message and if you take a step back, you might spot the problem. (There is nothing wrong with this data or its format.)

{
  "channel_id": "bobs-door-knobs",
  "visitor_id": "visitor1",
  "timestamp": "1544896138528",
  "timezone_offset": -60,
  "url": "https://door.knobsfrombobs.za/product/123456-78",
  "referrer_url": "https://door.knobsfrombobs.za/lister/cellar"
}

The issue is that we receive single events, whereas we designed our rows to be complete sessions. In practice, this would mean that rather than appending a new row, we would need to update part of an existing one (unless it is the first in the session). And updating rows, while possible, is not something BigQuery was designed to do. Updating rows is a rather expensive operation; it runs through all the bytes in the table. To add insult to injury, you cannot run multiple updates on the same table in parallel.

Getting Back On Track

We could use an additional technology, like Apache Beam, to aggregate events into sessions and trigger a write to BigQuery once a session has concluded. For our use case however, this is not needed. We will use an unaggregated flat table for our events and ship them over into the final sessions table every so often.

This ties in nicely with a closing note in Felipe Hoffa’s post[1]. He mentions that inserting data at multiple times during a day will weaken the clustering. So using a single table would have never been a good idea (and yes, that would have been an issue for the Beam job as well).

Hence we will use an additional table it is and copy it over every night. We will use the UTC time zone, which is the default for Google Cloud, to determine midnight. It is best to be explicit about time zones, but this way the punishment for accidental omission is less severe (or even present). We will skip clustering and partitioning; we do not expect to query this table a lot and it will contain at most a single day of data.

CREATE TABLE `bobs_knob_shops.events` ( 
  channel_id STRING,
  visitor_id STRING,
  timestamp TIMESTAMP,
  timezone_offset INT64,
  url STRING,
  referrer_url STRING
)

Don’t you love late night online knob shopping sprees? Alice sure does. Charles not so much, but he lives in a wildly different time zone. You might have already guessed it: job that simply copies over all the data in the daily table would leave you with broken sessions.

The Devil is in the Details

So we need to be smarter. Also note that while this daily job is running, new events will keep pouring in. And deleting the rows after they have been inserted into the aggregated table? Another expensive, semi-blocking operation for BigQuery.

The solution is to use two unaggregated tables: one for the even days, another for the odd. Writing is only done in the active table, which at the turn of midnight, will become the inactive table. Then all we need to do is:

aggregate events into sessions (for sessions that have been concluded) and insert those into our final sessions table,
insert non-concluded sessions events into the active table and
truncate the inactive table

Two more queries and Bob’s your uncle. Our first task is to identify closed sessions. We will need to define a rule for this; we will go with “a session is considered closed when no new events have been detected for a user for two hours”.

Applying this rule, we could run the job at midnight and add a clause filtering out all sessions which had events passed 22:00 h. While this would work, it would delay sessions ending well before midnight by 24 hours. That will cause inconsistencies in day-to-day reports. Quick fix: we run the job (at least) two hours after midnight on a union of both tables.

INSERT INTO `bobs_knob_shops.sessions` (
  channel_id, visitor_id, timezone_offset, start_time, end_time,
  hit_count, hits
)
SELECT
  channel_id,
  visitor_id,
  MAX(timezone_offset) timezone_offset,
  MIN(timestamp) start_time,
  MAX(timestamp) end_time,
  COUNT(1) hit_count,
  ARRAY_AGG(STRUCT(
    timestamp AS timestamp,
    url AS url,
    referrer_url AS referrer_url
  ) ORDER BY timestamp) hits
FROM (
  SELECT
    *,
    MAX(timestamp) OVER (PARTITION BY visitor_id) last_timestamp
  FROM
    `bobs_knob_shops.events_*` )
WHERE
  last_timestamp <= TIMESTAMP(CURRENT_DATE())
GROUP BY
  channel_id,
  visitor_id

Of course, the logic in the query for the cross-midnight sessions complements this one. There is however a small challenge: we cannot capture our table selection logic into BigQuery SQL (as of time of writing). So we will make it a simple select query and handle the data destination in the controlling application.

INSERT INTO `{}` (
  channel_id, visitor_id, timestamp, timezone_offset,
  url, referrer_url
)
SELECT
  channel_id,
  visitor_id,
  timestamp,
  timezone_offset,
  url,
  referrer_url
FROM (
  SELECT
    *,
    MAX(timestamp) OVER (PARTITION BY visitor_id) last_timestamp
  FROM
    `bobs_knob_shops.events_*` )
WHERE
  timestamp < TIMESTAMP(CURRENT_DATE())
  AND last_timestamp >= TIMESTAMP(CURRENT_DATE())

Tying It All Together

The final task is to truncate the inactive table. This is handled via two simple API calls: delete and create. We can use a simple AppEngine to act as our workflow manager; it has cron tab functionality to boot. We will also use it as our event-receiving API back-end.

Given that there are Google Client APIs available for all standard AppEngines, you are free to choose your favourite language. Here I will use Python.

# insert all closed sessions
client.query(AGGREGATION_QUERY)

# copy the rest to current event table
current = DAILIES[(days_since_epoch()) % 2]
client.query(LEFTOVER_QUERY.format(for_query(current)))

# truncate old table
previous = DAILIES[(days_since_epoch() + 1) % 2]
client.delete_table(previous)
client.query(CREATE_EVENTS_QUERY.format(for_query(previous)))

The (BigQuery) client calls should speak for themselves. They execute the queries mentioned earlier — after a little bit of string replacement.

I used a two-item list with the event table references (‘DAILIES’) and alternate between them using the number of days since January 1st, 1970. Using the day-of-year will cause trouble at year boundaries.

Inserting events into the tables uses the same table reference trick and is, by nature, even simpler.

row = request.json
table = client.get_table(DAILIES[days_since_epoch() % 2])
client.insert_rows(table, row)

Closing Notes

The logic described here should get your little event tracking system running. However, you will need to take additional measures before it can properly be released into production.

On the legal side of things there are regulations regarding the collection and storage of personal data. To be compliant with Europe’s GDPR laws[2], you might need to set up a workflow for deleting personal user data.

Continuing with the technical side, I assumed receiving perfectly valid and well-formed data.The flow in general is pretty happy-sunshine. So you might want to add some checks, logging and retry logic.

More importantly however, I also skipped over secured access and authentication. Though not necessarily an issue on the event-receiving endpoint, allowing unauthenticated users to freely trigger the aggregation workflow will have dire consequences for the system.

Fortunately, there are several ways to secure your AppEngine, depending on your flavour and version. So read up, get coding and, never forget, keep testing.

P. S.

The full demo code is available on my GitHub (link). Some logic was added for the sake of generalisation. It also includes a basic OAuth2 authentication flow.

Links

[1] Filipe Hoffa, “Optimizing BigQuery: Cluster your tables”, Medium.com, 2018. [Online]. Available: https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b [Accessed Jan. 30, 2019]

[2] European Commission, “Data Protection”, European Commission Website, 2018. [Online]. Available: https://ec.europa.eu/info/law/law-topic/data-protection_en [Accessed Jan. 30, 2019]