Manual partition tables in BigQuery

Published in

bluekiri

4 min readJun 13, 2018

BigQuery is the data warehouse offer in the Google Cloud Platform. It is a serverless, highly scalable, low-cost service to analyze your data.

BigQuery cost is based on the amount of stored data and the amount of data processed by the query. The storage cost has two levels: active for modified tables in the last 90 days ($0.02 per GB in EU region); and long-term for tables that have not been modified in the last 90 days ($0.01 per GB in the same region). The cost for on-demand queries is $5 per TB (first 1 TB per month is free) and if you have a stable monthly, at least $40,000 in monthly analytics spend, you have several flat-rate offers.

In our case, we are in a normal situation without flat-rate pricing and we need to control and optimize the data processed by our queries. The key strategy is to split our tables in smaller chunks, partitions. Nowadays we have two partitions strategies in BigQuery:

Tables partitioned by ingestion time: tables partitioned based on the data’s ingestion (load) date or arrival date.
Partitioned tables (in Beta): tables that are partitioned based on a timestamp or date column.

Both strategies are based on time, but we could have situations that we want to partition by a column which data type is not a timestamp or date.

The strategy is this case is to partition our data before loading in BigQuery. To demonstrate this case, we are going to use the public dataset bigquery-public-data:usa_names.usa_1910_current

And the table schema is the following

In our case, we are interested in a partition based on state field. If you execute the following query

SELECT state
FROM [bigquery-public-data:usa_names.usa_1910_current]
GROUP BY state
ORDER BY state

you will see that we have 51 states.

First of all, we are going to export the current data from this table. We create a bucket where we are going to export the info

And then export the table from BigQuery in a compressed file (in CSV format)

Next step, we download this file to our computer and we are going to split the CSV file in a file for every state

awk -F ',' 'NR==1{h=$0; next};!seen[$3]++{f="usa_names_"$1".csv"; print h > f};{f="usa_names_"$1".csv"; print >> f; close(f)}' usa_names.csv

The output is

Next step is upload the files to a bucket before loading in BigQuery. I have setup a script to load these files to BiqQuery. The key command is:

bq --location=<LOCATION> load --skip_leading_rows 1 --source_format=CSV <DataSet>.<BQTable>_<State> $FileName state:STRING,gender:STRING,year:INTEGER,name:STRING,number:INTEGER

The result is the following dataset

Now we can query every partition like a normal table, but how could you query all tables?

The first thought is to use a wildcard table like

SELECT
  name,
  SUM(number) AS count
FROM
  `test.usa_names-*`
GROUP BY
  name
ORDER BY
  count DESC
LIMIT
  10;

It’s a valid solution but

cached results are not supported for queries against multiple tables using a wildcard function (https://cloud.google.com/bigquery/docs/querying-wildcard-tables#limitations)

…then you are billed for subsequent queries with the same data. This is the main reason to write this article: use the wildcard function wisely.

Then the views come to the rescue. You can create a script to create a view with all related tables

SELECT
  state,
  gender,
  year,
  name,
  number
FROM
  `tensile-pixel-206016.test.usa_names_AK`
UNION ALL
SELECT
  state,
  gender,
  year,
  name,
  number
FROM
  `tensile-pixel-206016.test.usa_names_AL`
UNION ALL
...

One limitation to take into account is that the maximum number of tables referenced per query is 1000.

Or you can create views for your business requirements (for example, for East coast: CT,DE,MA,MD,ME,NH,NJ,NY,PA,RI,VT,WA)

SELECT
  state,
  gender,
  year,
  name,
  number
FROM
  `tensile-pixel-206016.test.usa_names_CT`
UNION ALL
SELECT
  state,
  gender,
  year,
  name,
  number
FROM
  `tensile-pixel-206016.test.usa_names_DE`
UNION ALL
...

To wrap up, the main benefit to using this pattern is when you query a subset of data more frequently; for example, we review the regional sales every day but we control the overall sales once a month. And anti-pattern of this solution is to create a dashboard and you query several views to extract the information (better to use the most general view).

If you have a problem… if no one else can help… and if you can find them… maybe you can hire… The Bluekiri team

Manual partition tables in BigQuery

Written by David Verdejo