Storing Snowplow bad row events in BigQuery

How to use Cloud Functions and a BigQuery schema generator to make Snowplow bad row schema violation events easily queryable

Jonathan Merlevede
datamindedbe
7 min readMay 11, 2021

--

Bad rows: when things go awry (source)

Snowplow allows redirection of rows that do not make it through enrichment to a “bad rows” topic. This is where they usually get redirected to the proverbial /dev/null or get picked up by a dangling subscription. On GCP, this means that they’ll perish at most seven days later.

Although this situation is not uncommon, it is unfortunate. When defining new event types and debugging errors, the information in bad row events is invaluable. Having to pull such events from a subscription means that you’ll have a hard time finding them, as there’s no easy way to query a subscription. Since collector endpoints are internet-facing, your errors may be hidden in a pile of actual bad rows.

The fate of bad rows in many Snowplow ingestion pipelines is lamentable

Ideally, bad rows end up in an easily queryable, low-cost data store, where you can keep them as long as you’d like. A data store like BigQuery! In this story, I’ll explain how to quickly get bad row schema violation events into BigQuery in real time, and without fixed costs!

Schema violation? Indeed, in this post I’m concerned mainly with schema violation events. Although there’s many reasons why a row could be marked as a bad row (such as a size violation or an adaptor failure), schema violations are the only events that I’ve found to be useful to look at in practice on GCP. YMMV!

Bad events are also events

The most obvious way to get schema violation events into BigQuery would be to redirect the bad events to the Snowplow collector. I say “obvious”, because these events have a well-defined JSON schema, which should mean that you can put them in a Snowplow event payload envelope as an unstructured event and ingest them as-is (unverified!). Ingestion of event data with a schema is kind of Snowplow’s thing. Check out my other story if you want some pointers on how to do this:

Schema violations have Iglu URI iglu:com.snowplowanalytics.snowplow.badrows/schema_violations/jsonschema/2-0-0 and you can find their schema here.

I did not end up going this route because…

  • it’s just a little weird to have “bad events” inserted as an event,
  • schema violation events have a lot of properties and I did not want to pollute my event table’s namespace (mostly personal preference; you also can’t have more than 10.000 columns in a single BigQuery table, but that’s a lot), but mostly
  • I wanted to avoid the possibility of bad events getting stuck in a loop (collector > enrichment > bad rows > function > collector > …).

As I didn’t try this approach, it may or may not “just work”. If you do try this, please share your findings in the comments!

Ingesting bad events using a Cloud Function

An alternative way is to register a Cloud Function as a subscriber to your bad rows topic, and have it insert a “bad row”-row into BigQuery directly. This approach, which I opted for, involves two steps. First, create a table with a schema appropriate for storing schema violations. Then, write the cloud function that “streaming inserts” events into your newly created table.

Defining a table schema

Using the schema violation’s schema to generate a BigQuery table schema

Creating a table for storing schema violations “from scratch” is tricky, as the schema of schema violations is quite convoluted (can you feel semantic satiation kicking in already?). Luckily, in cases like this one, there’s often great work done by others that can alleviate the pain.

Note: Since doing this, I’ve learned that there’s a Snowplow Incubator project with BigQuery bad row schema definitions. It contains schemata not only for schema violation events but also other bad row events. However, in these schemata, useful polymorphic fields like failure.error are missing. As it’s an Incubator project it might also disappear at some point. I’m keeping the explanation below, as I think it’s still useful.

The following tools made creating a BigQuery table schema from a JSON schema a relative breeze:

  1. De-reference the schema violation’s schema using json-dereference-cli (courtesy of David Calley).
  2. Use jsonschema-bigquery (courtesy of Mark Terry) to convert the de-referenced JSON schema to a BigQuery schema.

I had to make some manual adaptations to the schema violations’ schema to get jsonschema-bigquery to accept it, as it did not like its anyOf property (and rightly so).

Notes

  • You could turn the two “complex” steps listed above into a single easy step by using the schema I constructed for v2.0.0 schema violation events here. Remember that lazy people are more productive. You may consider using Snowplow’s schema definitions instead of mine.
  • JSON schema is more powerful than BigQuery table schemas. The schema violations’ schema contains polymorphic fields, that is, an anyOf property, that cannot be “properly” translated into a BigQuery schema. To accept all possible incoming data, the schema needs to cater to the lowest common denominator of possibilities left open by the JSON schema.
  • The above implies that, at least if you use my schema, some rows will still contain JSON data as strings. Luckily, you can use BigQuery’s JSON query operators to dig deeper!

The BigQuery table schema caters to the lowest common denominator of possibilities left open by the JSON schema

Defining a Cloud Function

Writing a Cloud Function that takes bad row schema event and (streaming) inserts rows to BigQuery shouldn’t take a big chunk of time out of your day. Even though it’s not perfect, the code from the story Pub/Sub to BigQuery does the trick with some small adaptations:

Notes

  • The entrypoint is pubsub_to_bigq, and the only dependency that should be specified in your requirements.txt file is google-cloud-bigquery.
  • Instead of omitting the polymorphic fields error and json (as in the Snowplow Incubator schemata), I flatten the error and json fields to a string (in alignment with the schema that I myself defined above).
  • The above code is not perfect. I have not validated that this works for all possible events, but it certainly works for the schema errors we get. Bad rows are non-essential to us, and I did not want to spend too much time. If you improve on the schemata or improve the Cloud Function, please share your improvements in the comments!

That’s it! You’re done! In the section below, I’ll briefly discuss some considerations I had while implementing the above.

Alternate solutions

Dataflow?

For event ingestion in BigQuery, customary practice would be to use Dataflow. While you could use Dataflow job to stream bad row events to BigQuery, this requires always-on instances that are not necessary for this use case. Often-cited advantage of using Dataflow over Cloud Functions for event ingestion in BigQuery are

  • de-duplication (Pub/Sub has at-least-once delivery guarantees, which means that some of you bad row events might get duplicated) and
  • batching (Using Cloud Functions, every bad row event will result in a streaming insert into BigQuery; in a Dataflow job you can batch these. If I understand BigQuery’s pricing for streaming inserts correctly, this advantage is not huge as it does not result in lower costs, as you’re billed per unit of volume with a minimum size for every row not for every insert).

I don’t mind duplicate bad rows. If it’s for some reason required, bad row events can easily be deduplicated by their ID (GUID). Pub/Sub also only very rarely actually delivers the same message more than once. At my current client, the number of bad row events is very small, and we use the information in bad row events only for debugging. Given these circumstances, I considered the use of a Cloud Function more appropriate.

Cloud Storage

The solution mentioned on the Snowplow Incubator GitHub page for bad row events suggests to define BigQuery tables as external tables on top of bad row events stored in JSONL format in BigQuery. Defining an external table is not hard; however, the only parts in this story involving work are defining a correct BigQuery table schema and getting the data out from Pub/Sub. Those you need to do regardless of where you store the data, as Beam Enrich does not allow for streaming bad row events directly to Cloud Storage. Depending on your usage, adding rows to a file in Cloud Storage might be a bit cheaper to operate than BigQuery streaming inserts.

I work at Data Minded, an independent data engineering and data analytics consultancy based in Leuven, Belgium. Snowplow, Terraform, Airflow, Spark; GCP, AWS or Azure. We’ve struggled with all of them. Contact us if you think we can help you!

--

--