BigQuery lazy data loading: DDL, DML, partitions, and half a trillion Wikipedia pageviews

What’s the simplest way to analyze a 4 terabyte dataset of Wikipedia pageviews made publicly available by the Wikimedia Foundation? In this post, I’ll show you how to take on this challenge using only BigQuery, while showcasing a few of its newest features.

Felipe Hoffa
Apr 19, 2018 · 8 min read

Step 0: Download all the data

Our first step is to transfer all the pageview records from an arbitrary web server to Google Cloud Storage. I ran this transfer with a Cloud Dataflow script written in Python, but we’ll leave the discussion of this and other methods for a later post.

For now, all we need to know is:

  • Wikimedia makes all their hourly pageviews available here.
  • This repository only publishes pageviews since May 2015, so if you’re interested in older data, you’ll need to look elsewhere.
  • Each hour of pageviews is available for download in a .gz file, containing around ~25 million rows of CSV data.

Once these files are copied into Cloud Storage, we’re ready to start analyzing them with BigQuery.

Step 1: Feeling lazy? Just read from Google Cloud Storage

BigQuery can easily analyze files living in Cloud Storage. We don’t even need to decompress them, as these specific files are natively supported by BigQuery (gzip). To tell BigQuery to read these files straight from Cloud Storage, we can create a federated table with the following script:

#!/bin/python
from google.cloud import bigquery
bq_client = bigquery.Client(project='fh-bigquery')
table_ref = bq_client.dataset('views').table('wikipedia_views_gcs')
table = bigquery.Table(table_ref, schema=SCHEMA)
extconfig = bigquery.ExternalConfig('CSV')
extconfig.schema = [bigquery.SchemaField('line', 'STRING')]
extconfig.options.field_delimiter = u'\u00ff'
extconfig.options.quote_character = ''
extconfig.compression = 'GZIP'
extconfig.options.allow_jagged_rows = False
extconfig.options.allow_quoted_newlines = False
extconfig.max_bad_records = 10000000
extconfig.source_uris=["gs://fh-bigquery/wikipedia-pagecounts/page*"]
table.external_data_configuration = extconfig
bq_client.create_table(table)

Notes on the above:

  • As I’m feeling lazy, I didn’t even try to parse each line into columns. I can do that later within a query. This is an easy way to deal with harder-to-parse files, like malformed JSON or CSV files.
  • To avoid BigQuery parsing this file into columns, I specified the weirdest ASCII character I could find as the field delimiter, in this case: u’\u00ff’
  • I could have completed this step with the web UI instead of Python (that would have been a lazier approach), but currently the web UI doesn’t allow me to specify a weird character as a field delimiter.
  • Some lines appear to be corrupted in the original source. We’ll ignore them by allowing 10000000 errors, as specified by setting max_bad_records.

1.1 Query files from Cloud Storage

Once I’ve set up my files in Cloud Storage as a federated table, I can start playing with it. Let’s see:

SELECT * FROM `fh-bigquery.views.wikipedia_views_gcs` LIMIT 10
Image for post
Image for post

BigQuery experts know that SELECT * LIMIT 10 is usually a bad idea with BigQuery, because you'll get billed for the full size of the table. But you won’t encounter this issue while running queries on federated tables! BigQuery stops opening files as soon as it has all the data it needs to answer a query, and in this case it charged us 0 bytes to show us the first 10 lines of one these files.

Which file did it open? Well, there’s a pseudo column for this:

SELECT *, _FILE_NAME fn 
FROM `fh-bigquery.views.wikipedia_views_gcs` LIMIT 10
Image for post
Image for post

Happily, we can see what files these lines are coming from. We can use this pseudo column when querying too. Let’s count the number of rows in a file:

SELECT COUNT(*) rs 
FROM `fh-bigquery.views.wikipedia_views_gcs`
WHERE _FILE_NAME =
'gs://fh-bigquery/wikipedia-pagecounts/pageviews-20150501-010000.gz'
(21.7s elapsed, 152 MB processed)
Image for post
Image for post

Notes on the above:

  • We can see this file (one hour of data) encompasses more than 5 million rows, and 152 MB of data.
  • Even though our table definition encompasses terabytes of data, BigQuery was smart enough to only open and charge us for the files we filtered for.
  • Warning: Don’t run the same query without a file filter! You could end up with a bill for terabytes of data, instead of a mere 152 MB. Remember to set up your cost controls to avoid these charges.

1.2 Parse your files with a view

What if we want to parse each line in a clever way, and save the effort of writing these queries to the rest of our team? We can define a view:

#standardSQL
CREATE VIEW `fh-bigquery.views.wikipedia_views_test_ddl`
AS SELECT
PARSE_TIMESTAMP('%Y%m%d-%H%M%S', REGEXP_EXTRACT(_FILE_NAME, '[0-9]+-[0-9]+')) datehour
, REGEXP_EXTRACT(line, '([^ ]*) ') wiki
, REGEXP_EXTRACT(line, '[^ ]* (.*) [0-9]+ [0-9]+') title
, CAST(REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') AS INT64) views
, CAST(REGEXP_EXTRACT(line, ' ([0-9]+)$') AS INT64) zero
, _FILE_NAME filename
, line
FROM `fh-bigquery.views.wikipedia_views_gcs`WHERE REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') IS NOT NULL # views
AND REGEXP_EXTRACT(line, ' ([0-9]+)$') = '0' # zero

Notes on the above:

  • BigQuery now supports DDL. We can create a view within a query!
  • I can use the _FILE_NAME meta column as data for my own columns.
  • The WHERE statements in this view take care of badly defined rows in the source files.

Let’s query our new view:

#standardSQL 
SELECT *
FROM `fh-bigquery.views.wikipedia_views_test_ddl`
LIMIT 10
(7.8s elapsed, 0 B processed)
Image for post
Image for post

Notes on the above:

  • We are now parsing the underlying files on the fly, and now we can work with well-typed columns.
  • Since this view is loading a federated table from Cloud Storage, the same rules apply to only opening as many tables as needed.

After aggregating our data, let’s see how many pageviews Wikipedia had in total on October 21, 2015 in the hour after 7:00 AM:

#standardSQL
SELECT COUNT(*) n_rows, SUM(views) views
, ARRAY_AGG(DISTINCT filename) files
FROM `fh-bigquery.views.wikipedia_views_test_ddl`
WHERE EXTRACT(YEAR FROM datehour)=2015
AND EXTRACT(MONTH FROM datehour)=10
AND EXTRACT(DAY FROM datehour)=21
AND EXTRACT(HOUR FROM datehour)=7
(75.8s elapsed, 159 MB processed)
Image for post
Image for post

Notes on the above:

  • This file with one hour of pageviews has more than 5 million rows, representing more than18 million views.
  • BigQuery was smart enough to pass our column filters down to Cloud Storage. Only one file was loaded and analyzed.
  • Warning: Don’t run the same query without a file filter! Instead of 159 MB, you could end up being charged for terabytes of data. Remember to set up your cost controls to avoid these pitfalls.

For example, don’t run this query:

SELECT SUM(views) views
, COUNT(DISTINCT filename) files
, COUNT(*) n_rowsFROM `fh-bigquery.views.wikipedia_views_gcs_parsed`
(1540.0s elapsed, 4.24 TB processed)545690866734 24795 153479054298

Step 2: Make BigQuery shine by lazily materializing this view

So far we have been able to read and analyze compressed files straight from Cloud Storage, plus a practical view to simplify our queries, but BigQuery really shines once data is loaded into it. Can we do this lazily?

2.1: Create a partitioned destination table with DDL

CREATE TABLE `fh-bigquery.wikipedia_v2.pageviews_2016`
(datehour TIMESTAMP, wiki STRING, title STRING, views INT64)
PARTITION BY DATE(datehour) OPTIONS(
description = 'Wikipedia pageviews from http://dumps.wikimedia.your.org/other/pageviews/'
, require_partition_filter = true
)

Some notes on the above:

  • BigQuery now supports DDL and partitioned tables by an arbitrary timestamp column!
  • With the previous query we created a partitioned table on an arbitrary column.
  • I chose to create yearly tables, to avoid running into the maximum limit on the number of supported partitions.
  • Using require_partition_filter, we are forcing users of this table to always specify a date range when querying. This filter requirement can be a great cost saver when you use partitions.

2.2: Copy data from the view into the new partitioned table

INSERT INTO `fh-bigquery.wikipedia_v2.pageviews_2016` (datehour, wiki, title, views)
WITH hours_already_loaded as (
SELECT DISTINCT datehour FROM `fh-bigquery.wikipedia_v2.pageviews_2016`
WHERE datehour > '2000-01-01')
SELECT datehour, wiki, title, views
FROM `fh-bigquery.views.wikipedia_views_gcs_parsed` t1
WHERE EXTRACT(YEAR FROM datehour)=2016
AND NOT EXISTS (SELECT * FROM hours_already_loaded t2 WHERE t2.datehour = t1.datehour)
(1082.2s elapsed, 1.46 TB processed)

Some notes on the above:

  • BigQuery now supports DML, and loading data from a view into a table just by using SQL.
  • This operation took more than half an hour, but it seems fair given that we moved more than 1 terabyte of data from compressed files into a native BigQuery table.
  • Usually loading data into BigQuery is free, but in this case we were charged the cost of the query over the view.
  • WHERE EXTRACT(YEAR FROM datehour)=2016 allowed this query to only go over the 2016 files. BigQuery was smart about it.
  • AND datehour NOT IN (...) makes sure that we load new files into this table, but only if we haven't loaded these files already.
  • WHERE datehour > '2000-01-01' doesn't do much - but as we asked BigQuery to force us to add a date filter when querying this partitioned table, this filter does it.

Step 3: Stay curious

I’m making the partitioned table public, so that now you can write queries over the latest and historic Wikipedia pageviews. For example what’s the number of views of Bobsleigh versus Curling? And pageviews from desktop clients versus mobile ones? Check out the results in this query:

#standardSQL
SELECT datehour, title, views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) BETWEEN "2018-02-10" AND "2018-02-28"
AND wiki = 'en'AND title IN ('Curling', 'Bobsleigh')
(23.2s elapsed, 109 GB processed)
Image for post
Image for post
#standardSQL
SELECT datehour, title, views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) BETWEEN "2018-02-10" AND "2018-02-28"
AND wiki = 'en.m'
AND title IN ('Curling', 'Bobsleigh')
(22.2s elapsed, 110 GB processed)
Image for post
Image for post

What’s the difference between these 2 queries? Turns out people used mobile browsers ~10x over desktop ones. Can we optimize that 109 GB query? We’ll try that in a future post.

Remember to set up your cost controls to avoid surprises as you experiment with your own queries, and also remember that you have 1 terabyte of free queries every month. Find me @felipehoffa for more data stories.

Originally published at cloud.google.com.

Google Cloud - Community

Google Cloud community articles and blogs

Felipe Hoffa

Written by

Data Cloud Advocate at SnowflakeDB ❄️. Originally from Chile, now in San Francisco and around the world. (Previously at Google ). Let’s talk data.

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Felipe Hoffa

Written by

Data Cloud Advocate at SnowflakeDB ❄️. Originally from Chile, now in San Francisco and around the world. (Previously at Google ). Let’s talk data.

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store