BigQuery lazy data loading: DDL, DML, partitions, and half a trillion Wikipedia pageviews

What’s the simplest way to analyze a 4 terabyte dataset of Wikipedia pageviews made publicly available by the Wikimedia Foundation? In this post, I’ll show you how to take on this challenge using only BigQuery, while showcasing a few of its newest features.

Felipe Hoffa
Apr 19, 2018 · 8 min read

Step 0: Download all the data

Our first step is to transfer all the pageview records from an arbitrary web server to Google Cloud Storage. I ran this transfer with a Cloud Dataflow script written in Python, but we’ll leave the discussion of this and other methods for a later post.

For now, all we need to know is:

  • Wikimedia makes all their hourly pageviews available here.
  • This repository only publishes pageviews since May 2015, so if you’re interested in older data, you’ll need to look elsewhere.
  • Each hour of pageviews is available for download in a .gz file, containing around ~25 million rows of CSV data.

Once these files are copied into Cloud Storage, we’re ready to start analyzing them with BigQuery.

Step 1: Feeling lazy? Just read from Google Cloud Storage

BigQuery can easily analyze files living in Cloud Storage. We don’t even need to decompress them, as these specific files are natively supported by BigQuery (gzip). To tell BigQuery to read these files straight from Cloud Storage, we can create a federated table with the following script:

#!/bin/python
from google.cloud import bigquery
bq_client = bigquery.Client(project='fh-bigquery')
table_ref = bq_client.dataset('views').table('wikipedia_views_gcs')
table = bigquery.Table(table_ref, schema=SCHEMA)
extconfig = bigquery.ExternalConfig('CSV')
extconfig.schema = [bigquery.SchemaField('line', 'STRING')]
extconfig.options.field_delimiter = u'\u00ff'
extconfig.options.quote_character = ''
extconfig.compression = 'GZIP'
extconfig.options.allow_jagged_rows = False
extconfig.options.allow_quoted_newlines = False
extconfig.max_bad_records = 10000000
extconfig.source_uris=["gs://fh-bigquery/wikipedia-pagecounts/page*"]
table.external_data_configuration = extconfig
bq_client.create_table(table)

Notes on the above:

  • As I’m feeling lazy, I didn’t even try to parse each line into columns. I can do that later within a query. This is an easy way to deal with harder-to-parse files, like malformed JSON or CSV files.
  • To avoid BigQuery parsing this file into columns, I specified the weirdest ASCII character I could find as the field delimiter, in this case: u’\u00ff’
  • I could have completed this step with the web UI instead of Python (that would have been a lazier approach), but currently the web UI doesn’t allow me to specify a weird character as a field delimiter.
  • Some lines appear to be corrupted in the original source. We’ll ignore them by allowing 10000000 errors, as specified by setting .

1.1 Query files from Cloud Storage

Once I’ve set up my files in Cloud Storage as a federated table, I can start playing with it. Let’s see:

SELECT * FROM `fh-bigquery.views.wikipedia_views_gcs` LIMIT 10

BigQuery experts know that is usually a bad idea with BigQuery, because you'll get billed for the full size of the table. But you won’t encounter this issue while running queries on federated tables! BigQuery stops opening files as soon as it has all the data it needs to answer a query, and in this case it charged us 0 bytes to show us the first 10 lines of one these files.

Which file did it open? Well, there’s a pseudo column for this:

SELECT *, _FILE_NAME fn 
FROM `fh-bigquery.views.wikipedia_views_gcs` LIMIT 10

Happily, we can see what files these lines are coming from. We can use this pseudo column when querying too. Let’s count the number of rows in a file:

SELECT COUNT(*) rs 
FROM `fh-bigquery.views.wikipedia_views_gcs`
WHERE _FILE_NAME =
'gs://fh-bigquery/wikipedia-pagecounts/pageviews-20150501-010000.gz'
(21.7s elapsed, 152 MB processed)

Notes on the above:

  • We can see this file (one hour of data) encompasses more than 5 million rows, and 152 MB of data.
  • Even though our table definition encompasses terabytes of data, BigQuery was smart enough to only open and charge us for the files we filtered for.
  • Warning: Don’t run the same query without a file filter! You could end up with a bill for terabytes of data, instead of a mere 152 MB. Remember to set up your cost controls to avoid these charges.

1.2 Parse your files with a view

What if we want to parse each line in a clever way, and save the effort of writing these queries to the rest of our team? We can define a view:

#standardSQL
CREATE VIEW `fh-bigquery.views.wikipedia_views_test_ddl`
AS SELECT
PARSE_TIMESTAMP('%Y%m%d-%H%M%S', REGEXP_EXTRACT(_FILE_NAME, '[0-9]+-[0-9]+')) datehour
, REGEXP_EXTRACT(line, '([^ ]*) ') wiki
, REGEXP_EXTRACT(line, '[^ ]* (.*) [0-9]+ [0-9]+') title
, CAST(REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') AS INT64) views
, CAST(REGEXP_EXTRACT(line, ' ([0-9]+)$') AS INT64) zero
, _FILE_NAME filename
, line
FROM `fh-bigquery.views.wikipedia_views_gcs`WHERE REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') IS NOT NULL # views
AND REGEXP_EXTRACT(line, ' ([0-9]+)$') = '0' # zero

Notes on the above:

  • BigQuery now supports DDL. We can create a view within a query!
  • I can use the meta column as data for my own columns.
  • The statements in this view take care of badly defined rows in the source files.

Let’s query our new view:

#standardSQL 
SELECT *
FROM `fh-bigquery.views.wikipedia_views_test_ddl`
LIMIT 10
(7.8s elapsed, 0 B processed)

Notes on the above:

  • We are now parsing the underlying files on the fly, and now we can work with well-typed columns.
  • Since this view is loading a federated table from Cloud Storage, the same rules apply to only opening as many tables as needed.

After aggregating our data, let’s see how many pageviews Wikipedia had in total on October 21, 2015 in the hour after 7:00 AM:

#standardSQL
SELECT COUNT(*) n_rows, SUM(views) views
, ARRAY_AGG(DISTINCT filename) files
FROM `fh-bigquery.views.wikipedia_views_test_ddl`
WHERE EXTRACT(YEAR FROM datehour)=2015
AND EXTRACT(MONTH FROM datehour)=10
AND EXTRACT(DAY FROM datehour)=21
AND EXTRACT(HOUR FROM datehour)=7
(75.8s elapsed, 159 MB processed)

Notes on the above:

  • This file with one hour of pageviews has more than 5 million rows, representing more than18 million views.
  • BigQuery was smart enough to pass our column filters down to Cloud Storage. Only one file was loaded and analyzed.
  • Warning: Don’t run the same query without a file filter! Instead of 159 MB, you could end up being charged for terabytes of data. Remember to set up your cost controls to avoid these pitfalls.

For example, don’t run this query:

SELECT SUM(views) views
, COUNT(DISTINCT filename) files
, COUNT(*) n_rowsFROM `fh-bigquery.views.wikipedia_views_gcs_parsed`
(1540.0s elapsed, 4.24 TB processed)545690866734 24795 153479054298

Step 2: Make BigQuery shine by lazily materializing this view

So far we have been able to read and analyze compressed files straight from Cloud Storage, plus a practical view to simplify our queries, but BigQuery really shines once data is loaded into it. Can we do this lazily?

2.1: Create a partitioned destination table with DDL

CREATE TABLE `fh-bigquery.wikipedia_v2.pageviews_2016`
(datehour TIMESTAMP, wiki STRING, title STRING, views INT64)
PARTITION BY DATE(datehour) OPTIONS(
description = 'Wikipedia pageviews from http://dumps.wikimedia.your.org/other/pageviews/'
, require_partition_filter = true
)

Some notes on the above:

  • BigQuery now supports DDL and partitioned tables by an arbitrary timestamp column!
  • With the previous query we created a partitioned table on an arbitrary column.
  • I chose to create yearly tables, to avoid running into the maximum limit on the number of supported partitions.
  • Using , we are forcing users of this table to always specify a date range when querying. This filter requirement can be a great cost saver when you use partitions.

2.2: Copy data from the view into the new partitioned table

INSERT INTO `fh-bigquery.wikipedia_v2.pageviews_2016` (datehour, wiki, title, views)
WITH hours_already_loaded as (
SELECT DISTINCT datehour FROM `fh-bigquery.wikipedia_v2.pageviews_2016`
WHERE datehour > '2000-01-01')
SELECT datehour, wiki, title, views
FROM `fh-bigquery.views.wikipedia_views_gcs_parsed` t1
WHERE EXTRACT(YEAR FROM datehour)=2016
AND NOT EXISTS (SELECT * FROM hours_already_loaded t2 WHERE t2.datehour = t1.datehour)
(1082.2s elapsed, 1.46 TB processed)

Some notes on the above:

  • BigQuery now supports DML, and loading data from a view into a table just by using SQL.
  • This operation took more than half an hour, but it seems fair given that we moved more than 1 terabyte of data from compressed files into a native BigQuery table.
  • Usually loading data into BigQuery is free, but in this case we were charged the cost of the query over the view.
  • allowed this query to only go over the 2016 files. BigQuery was smart about it.
  • makes sure that we load new files into this table, but only if we haven't loaded these files already.
  • doesn't do much - but as we asked BigQuery to force us to add a date filter when querying this partitioned table, this filter does it.

Step 3: Stay curious

I’m making the partitioned table public, so that now you can write queries over the latest and historic Wikipedia pageviews. For example what’s the number of views of Bobsleigh versus Curling? And pageviews from desktop clients versus mobile ones? Check out the results in this query:

#standardSQL
SELECT datehour, title, views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) BETWEEN "2018-02-10" AND "2018-02-28"
AND wiki = 'en'AND title IN ('Curling', 'Bobsleigh')
(23.2s elapsed, 109 GB processed)
#standardSQL
SELECT datehour, title, views
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE DATE(datehour) BETWEEN "2018-02-10" AND "2018-02-28"
AND wiki = 'en.m'
AND title IN ('Curling', 'Bobsleigh')
(22.2s elapsed, 110 GB processed)

What’s the difference between these 2 queries? Turns out people used mobile browsers ~10x over desktop ones. Can we optimize that 109 GB query? We’ll try that in a future post.

Remember to set up your cost controls to avoid surprises as you experiment with your own queries, and also remember that you have 1 terabyte of free queries every month. Find me @felipehoffa for more data stories.


Originally published at cloud.google.com.

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Felipe Hoffa

Written by

Developer Advocate @Google. Originally from Chile, now in San Francisco and around the world. Let’s talk data.

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade