BigQuery Dataset Metadata Queries

Warrick
Google Cloud - Community
3 min readJun 8, 2020

When working with tables in BigQuery, you need an understanding of a dataset structure whether it is public or you set it up and you want to review. This is a quick bit to share queries you can use to pull metadata on your datasets and tables.

In the following examples, I’m using the BigQuery public Stack Overflow database to demonstrate these commands. Change out the names as needed for the dataset and tables you are working with.

Dataset Metadata

Get a list of all tables in the dataset and the corresponding information.

SELECT *
FROM bigquery-public-data.stackoverflow.INFORMATION_SCHEMA.TABLES

Query processed 10MB when run and column results include:

  • table_catalog (name of catalog)
  • table_schema (name of dataset)
  • table_name
  • table_type
  • is_insertable_into
  • is_typed
  • creation_time (datetime format)

Additional table details including number of rows and table data size.

SELECT *
FROM bigquery-public-data.stackoverflow.__TABLES__

Note, its two underscores on both sides of the TABLES above.

Query processed 0B when run and column results include:

  • project_id (same as table_catalog in schema)
  • dataset_id (same as table_schema in schema)
  • table_id (same as table_name in schema)
  • creation_time (timestamp format )
  • last_modified_time(timestamp format)
  • row_count
  • size_bytes
  • type

In the above query, I modified the query as follows to make changes so timestamp was in datetime format and and size_bytes were GB.

SELECT project_id, dataset_id, table_id as table_name, CAST(TIMESTAMP_MILLIS(creation_time) AS DATETIME) as creation_time,  CAST(TIMESTAMP_MILLIS(last_modified_time) AS DATETIME) as last_modified_time, row_count, size_bytes / POW(10,9) as GB, type
FROM bigquery-public-data.stackoverflow.__TABLES__

I also changed table_id to table_name to make it easier to merge with the first query in this post. When merging, I can leave off project_id and table_catalog since they are redundant.

Table Metadata

To get details about specific tables in the dataset, pull the table name and include in the following query like this example using the posts_questions table from the stackoverflow dataset.

SELECT *
FROM bigquery-public-data.stackoverflow.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'posts_questions'

Query processed 10MB when run and column results include:

  • table_catalog (name of catalog)
  • table_schema (name of dataset)
  • table_name
  • column_name
  • ordinal_position
  • is_nullable (T/F)
  • data_type

Cost

A side note on cost is that BQ offers queries up to the first 1TB of query data processed per month for free. Beyond that it depends on your pricing model. On-demand queries against INFORMATION_SCHEMA incur a minimum of 10MB of data processing charges even if the bytes processed are less. For flat-rate pricing these consume BQ slots. It’s important to note these queries are not stored so you are charged each time you run one. Basically running metadata queries will usually be nominal. If you want to dig deeper to understand potential query costs in general check out this BQ pricing resource.

Wrap Up

Above reviews a couple key queries to pull dataset and table metadata from BigQuery. There are other views such as dataset jobs, reservation and routines and you get more information and details in the INFORMATION_SCHEMA guide. Views are there to help you get a big picture view of the structure of the datasets and tables you are working with so you can plan how best to engage.

--

--