How to keep metadata of all tables in BigQuery?

Published in

Trendyol Tech

5 min readDec 20, 2021

Hi to everyone 👋 We will talk about how we can keep metadata of all tables in BigQuery. Of course there are lots of ways for keeping but we will talk about doing it with SQL statement. Our content contains:

Declare variable in BigQuery.
Using INFORMATION_SCHEMA to get metadata about tables and datasets.
Using Loops in BigQuery.
Using EXECUTE IMMEDIATE to execute dynamic SQL statement.
Create metadata of all tables in our BigQuery project.
Create metadata tables via python api.

Why we need metadata?

Metadata is “data that provides information about other data”.

Imagine that you have many tables in BigQuery and you have many reports generated from these tables. When a problem arises in a report, we need the information of the tables used in this report to reach the source of the problem. When was this table created? When was it deleted? When was it last updated? Thanks to this information, you can reach the solution step by step.

Let’s start.

1- Declare variable in BigQuery

We can declare a variable of the specified type. If you want to set a value to variable you should use DEFAULT clause. If you don’t use default clause, variable will be null firstly. Let’s look with scenario.

For example you need a constant date variable. You will subtract 3 days from now and you will use it in many places in your sql.

declare _date default (select date_sub(current_date(), INTERVAL 3 day))

declare _date datetime;
set _date = date_sub(current_date(), INTERVAL 3 day)

2- INFORMATION_SCHEMA

INFORMATION_SCHEMA is a series of views that provide access to dataset metadata, job metadata, table metadata, view metadata etc.

We can get project based metadata. It gives us metadata at dataset level.

select * from your-project.INFORMATION_SCHEMA.SCHEMATA;

We can get dataset based metadata. It gives us metadata at tables level. Also you can see all of your tables list in your dataset from there.

select * from your-dataset.INFORMATION_SCHEMA.TABLES;

We can look for our tables metadata with different ways. There are row_count, size_bytes too. We will use these tables for our project. type=1 is for tables, type=2 for views.

select * from your-project.test_dataset.__TABLES__;

3- Loops

Executes sql_statement_list until a BREAK or LEAVE statement exits the loop. sql_statement_list is a list of zero or more SQL statements ending with semicolons.

We will use the loop clause to loop through datasets. Let’s look at the first part.

#1
declare i int64;
declare dataset_name string;
set i = 0;
#2
create temp table dataset_list 
as
(select 
    schema_name, 
    row_number() over() as rn
 from
    your-project.INFORMATION_SCHEMA.SCHEMATA);
#3
loop
    set i = i+1;
    if i > (select max(rn) from dataset_list)
      then leave;
    end if;
   set dataset_name =(
       select 
          schema_name 
       from 
          dataset_list 
       where rn = i);
end loop;

#1 We declare our variables to use in loop.
#2 We create temp table for dataset lists and we give row_number for each dataset.
#3 We set dataset_name for every step in loop. After we will use this variable in EXECUTE IMMEDIATE statement.

4- EXECUTE IMMEDIATE

Executes a dynamic SQL statement on the fly.

EXECUTE IMMEDIATE provides using a variable in sql statement. Also you can store the results in one or more variables. Check the documentation for more information. We will use EXECUTE IMMEDIATE for using dataset_name variable in sql statement.

An example from the official documentation :

-- create some variables
DECLARE book_name STRING DEFAULT 'Ulysses';
DECLARE book_year INT64 DEFAULT 1922;
DECLARE first_date INT64;

-- Create a temporary table called Books.
EXECUTE IMMEDIATE
  "CREATE TEMP TABLE Books (title STRING, publish_date INT64)";

-- Add a row for Hamlet (less secure)
EXECUTE IMMEDIATE
  "INSERT INTO Books (title, publish_date) VALUES('Hamlet', 1599)";

-- add a row for Ulysses, using the variables declared at the top of this
-- script and the ? placeholder
EXECUTE IMMEDIATE
  "INSERT INTO Books (title, publish_date) VALUES(?, ?)"
  USING book_name, book_year;

5- Create metadata table for all tables

Now we will create table contains all tables metadata. Firstly create a table to keep them.

create table your-project.your-dataset.tables_last_modified_times
(project_id STRING,
dataset_id STRING,
table_id STRING,
last_modified_time TIMESTAMP,
creation_time TIMESTAMP,
row_count int64,
size_bytes int64)

Then write merge statement with EXECUTE IMMEDIATEto update our table. You should fill your-projectand your-dataset fields in statement.

EXECUTE IMMEDIATE CONCAT('merge into your-project.your-dataset.tables_last_modified_times as tbl1 ',
'using(SELECT project_id, dataset_id, table_id, timestamp_millis(last_modified_time) as last_modified_time, timestamp_millis(creation_time) as creation_time, row_count, size_bytes FROM ',
'your-project.',dataset_name,'.__TABLES__ where type=1) tbl2 ',
 'ON tbl1.project_id = tbl2.project_id and tbl1.dataset_id = tbl2.dataset_id and tbl1.table_id = tbl2.table_id ',
'when matched then update set tbl1.last_modified_time = tbl2.last_modified_time, tbl1.row_count = tbl2.row_count, tbl1.size_bytes = tbl2.size_bytes ',
'when not matched then insert (project_id,dataset_id,table_id,last_modified_time,creation_time,row_count,size_bytes) values (tbl2.project_id, tble2.dataset_id, tbl2.table_id, tbl2.last_modified_time, tbl2.creation_time, tbl2.row_count, tbl2.size_bytes)');

We got just tables with type=1 .
If you want to keep historical table, you can change query with insert statement.

Now we can combine all the statements. You can update the table by scheduling this query as you want.

6- Create metadata table via python api

Another option to create metadata table is using python api. Firstly create table like 5th step.

We have some important points here. Before run the code, you should control your query limits. With this code we create a sql statement contains all datasets’ metadata tables. So i run query for every 40 datasets. You can change this number according to yourself.

We can keep historical table via this code. You can schedule the code below.

Create last_updated_tables via python api

Summary

We look declare, information_schema, loops and execute immediate subjects.
We declare variables to use in loop. We have dataset metadata in your-project.INFORMATION_SCHEMA.SCHEMATA . Also we have tables metadata in your-project.dataset_name.__TABLES__ .
We want to have all of the tables metadata in one table. So we used EXECUTE IMMEDIATE CONCAT to use dataset_name variable and combine all tables metadata in one table.
Also we keep metadata table via python api.

Thanks for your time. Happy days! 🥳