A guide to BigQuery INFORMATION_SCHEMA

Abhik Saha
Google Cloud - Community
5 min readApr 20, 2023

--

Introduction

In BigQuery, the INFORMATION_SCHEMA is a special schema that contains metadata about datasets, tables, views, columns, and other objects within a dataset. This schema can be queried like any other schema in BigQuery, and it provides a convenient way to explore the structure and properties of datasets and their objects.

Image credits :dataedo.com

The INFORMATION_SCHEMA schema contains a set of tables, each of which corresponds to a different type of metadata. The BigQuery INFORMATION_SCHEMA views are read-only, system-defined views that provide metadata information about your BigQuery objects.

The INFORMATION_SCHEMA schema is read-only, which means you cannot modify the metadata it contains. However, you can use the information it provides to help you create and manage your datasets and tables in BigQuery.

Here are some examples of queries you can run using the INFORMATION_SCHEMA in BigQuery:

  1. To query some key details about a table:
SELECT table_catalog,table_schema,table_name,table_type,cast(creation_time as date) as creation_date,ddl
FROM `plated-field-383807.manual_input.INFORMATION_SCHEMA.TABLES`
WHERE table_name = 'all_orders_table';
INFORMATION_SCHEMA.TABLES

2. To query data about partitioned tables:

SELECT
table_schema,table_name,
partition_id,total_rows,total_logical_bytes,total_billable_bytes,
cast(last_modified_time AS STRING FORMAT 'YYYY-MM-DD HH24:MI:SS') as last_modified_datetime
FROM
`manual_input.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name='all_orders_table';
INFORMATION_SCHEMA.PARTITIONS

3. To query data about procedures, UDFs etc:

SELECT
specific_catalog,specific_schema,specific_name,routine_type,routine_body, ddl,
cast(created AS STRING FORMAT 'YYYY-MM-DD HH24:MI:SS') as created_datetime
FROM
`manual_input.INFORMATION_SCHEMA.ROUTINES`;
INFORMATION_SCHEMA.ROUTINES

Dataset qualifiers

You can use dataset qualifiers with the following views in the INFORMATION_SCHEMA:

  • COLUMNS
  • COLUMN_FIELD_PATHS
  • PARAMETERS
  • PARTITIONS
  • ROUTINES
  • ROUTINE_OPTIONS
  • TABLES
  • TABLE_OPTIONS
  • VIEWS

The dataset qualifiers allow you to specify which dataset the metadata should be retrieved from when querying these views. This can be useful if you have multiple datasets with similar or identical object names and you need to differentiate between them.

Region qualifier

To specify a region in BigQuery, you can use the syntax region-REGION, where REGION is the name of the location. You can use any valid dataset location name as the REGION value, such as region-us, region-asia-east2, or region-europe-north1.

Using a region qualifier in a query restricts the results to the specified location. For example, if you want to retrieve metadata for all datasets in a project in the US multi-region, you can use the following query:

SELECT * FROM `region-asia-east2.INFORMATION_SCHEMA.SCHEMATA`;
your-region.INFORMATION_SCHEMA.SCHEMATA

Region qualifiers cannot be used with the following INFORMATION_SCHEMA views:

  • INFORMATION_SCHEMA.PARTITIONS
  • INFORMATION_SCHEMA.SEARCH_INDEXES
  • INFORMATION_SCHEMA.SEARCH_INDEX_COLUMNS

If you don’t specify either a region qualifier or a dataset qualifier when querying these views, you will receive an error. It’s important to keep this in mind when writing queries that involve these views to avoid encountering errors.

Project qualifier

You can restrict the results of a query to a specific project in BigQuery by using a project qualifier. To use this qualifier, you specify the name of the project before the dataset qualifier (if any) in the query. Here are some examples:

-- Returns metadata for the specified project and region.
SELECT * FROM myProject.`region-us`.INFORMATION_SCHEMA.TABLES;

-- Returns metadata for the specified project and dataset.
SELECT * FROM myProject.myDataset.INFORMATION_SCHEMA.TABLES;

In these examples, the first query returns metadata for the specified project and region, while the second query returns metadata for the specified project and dataset.

All INFORMATION_SCHEMA views support project qualifiers. If you don’t specify a project qualifier, the view will default to the project in which the query is running.

It’s worth noting that specifying a project qualifier for organization-level views, such as STREAMING_TIMELINE_BY_ORGANIZATION, has no impact on the results.

USE CASES

BigQuery’s metadata using INFORMATION_SCHEMA views can be helpful for optimizing your data warehouse queries:

  1. Identifying expensive queries: By querying the JOBS view, you can get detailed information about past queries that have been executed in BigQuery, including their execution times and costs. This can help you identify expensive queries that might be impacting your budget and optimize them for better performance.
  2. Understanding table structures: By querying the COLUMNS view, you can get information about the columns in your tables, including their data types and nullability. This can help you optimize your queries by ensuring that you're using the most appropriate data types and minimizing unnecessary joins and aggregations.
  3. Monitoring table growth: By querying the TABLES view, you can get information about the size and growth of your tables over time. This can help you identify tables that are growing quickly and may need to be partitioned or otherwise optimized for better performance.
  4. Exploring dataset relationships: By querying the TABLES and VIEWS,you can get information about the relationships between your datasets, tables, and views. This can help you better understand your data and optimize your queries by reducing unnecessary joins and subqueries.

Here’s an example query that you could use to get a list of the top ten most expensive queries that have been executed in your project over the past 30 days:

SELECT
project_id,
job_id,
user_email,
creation_time,
start_time,
end_time,
total_slot_ms,
total_bytes_billed,
total_bytes_processed,
--billing_tier,
cache_hit,
query
FROM
`region-us`.INFORMATION_SCHEMA.JOBS
WHERE
creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
ORDER BY
total_bytes_billed DESC
LIMIT
10;

Constraints

When working with BigQuery’s INFORMATION_SCHEMA, keep in mind the following:

  • Queries using INFORMATION_SCHEMA must use GoogleSQL syntax and not legacy SQL, as the latter is not supported.
  • INFORMATION_SCHEMA query results are not cached, so each query will result in a fresh set of data being retrieved from the metadata store.
  • INFORMATION_SCHEMA views cannot be used in Data Definition Language (DDL) statements, so you cannot use them to create, modify, or drop tables or other objects in BigQuery. They are purely for querying metadata about existing objects in the data warehouse.

COST

When it comes to pricing for INFORMATION_SCHEMA queries in BigQuery:

  • If your project uses on-demand pricing, queries against INFORMATION_SCHEMA views will incur a minimum of 10 MB of data processing charges, regardless of the actual bytes processed by the query.
  • This means that even if your query processes less than 10 MB of data, you will still be billed for at least 10 MB worth of data processing.
  • The reason for this is that 10 MB is the minimum billing amount for on-demand queries in BigQuery.
  • To learn more about BigQuery’s on-demand pricing and how it works, you can refer to the official documentation.

--

--

Abhik Saha
Google Cloud - Community

Data Engineer @Accenture India || Writes about Bigquery, Cloud Function, GCP, SQL || LinkedIn ID: https://www.linkedin.com/in/abhik-saha-919646108/