BigTips: INFORMATION_SCHEMA Views in BigQuery, Part 2, with extra Scripts and Procedures!
When working with and implementing BigQuery, there’s a number of small problems to which I couldn’t find documentation to, or a working example of a solution. This occasionally happens with any database. While these won’t probably be groundbreaking problems to solve, hopefully it’ll make someone’s day a little easier. Sometimes, it’s the little things.
BigTips: Make working with BigQuery’s INFORMATION_SCHEMA
a little bit easier by flattening the tables!
This is a follow up to my earlier post on flattening the views to get a consolidated view of a dataset’s metadata. Be sure to read that first, as this will be assuming you’ve gone through that already and builds upon that.
In the last post, the views that we were looking at joined the TABLES
, TABLE_OPTIONS
, and COLUMN_FIELD_PATHS
views. While that is helpful (or at least I’d like to think so), one of the limitations is that it can only do this within the context of an individual dataset. What happens, though, when you want to create a metadata catalog that collects information for all the datasets and tables within an entire project?
This is where BigQuery Scripting and Procedures, and our friend the EXECUTE IMMEDIATE
command, come in handy. What we’re going to do is create a procedure that collects all the datasets to scan, dynamically construct query strings that collect metadata for each dataset, and store the results in a single table for querying.
What lets us do this is another view called SCHEMATA
. It’s pretty simple, and all it does is return a list of the datasets within a single project. You can query this with SELECT * FROM INFORMATION_SCHEMA.SCHEMATA
.
It’s pretty straightforward. From here you can just select on the schema_name
column to get a list of the datasets, and construct dynamic queries from that. Let’s get a simple example of what this looks like.
Let’s break this down a bit.
DECLARE schema_list ARRAY<STRING>;
DECLARE iter INT64 DEFAULT 0;
DECLARE query_string STRING;
With BigQuery Scripting, all variables must be declared at the very beginning no matter where it’s used. It’s nice to have everything in one place so you’re not hunting for variables, but also for temporary context specific things like loop iterators, it can get confusing, especially with super long scripts. You win some, you lose some.
SET schema_list = (
SELECT
ARRAY_AGG(schema_name)
FROM
INFORMATION_SCHEMA.SCHEMATA
WHERE
(schema_name = "hellometadata")
OR (schema_name = "geo"));
This one’s pretty simple. I’m just trying to get a list of all the dataset names in the project, aggregate them into an array, and put that array into our variable. Here, just for illustrative purposes, I’m restricting it to our hellometadata
dataset from the previous post, and another one I happen to have around called geo
.
WHILE
iter < ARRAY_LENGTH(schema_list) DO
SET query_string = "SELECT * EXCEPT (table_catalog) FROM "
|| schema_list[OFFSET(iter)] || ".INFORMATION_SCHEMA.TABLES";
EXECUTE IMMEDIATE query_string;
SET iter = iter + 1;
END WHILE;
This loop iterates through the array of dataset names. For each dataset, it will construct a string literal of our SQL query embedding the dataset name into a SELECT
statement that gets all the information from that dataset’s INFORMATION_SCHEMA.TABLES
view. The EXECUTE IMMEDIATE
statement just runs whatever is in that string. The good thing is that this combination of statements allows you to dynamically create SQL queries, handy for use cases like this. The downside to it is that there is no direct query validation that happens in the BigQuery Console, since it treats it as a string literal.
You’ll see that it ran three jobs. The first job is the query it runs to set the schema_list
variable. The next two are the EXECUTE IMMEDIATE
commands it runs, one for each dataset (two in this example).
If you click on the “view results” button for one of jobs querying the views, you will see the metadata around what happened with that particular step. This will look a lot like the job information output of a normal query. You can click the back button in the query results section and click “view results” of the other queries to see what they look like.
And now you have a quick example of how to dynamically create query strings and run them in sequence!
Using that method, we can now take the query from the last post and combine it with this script to iterate across all datasets and output the metadata results into a single table for querying! You can find the full query here.
Conceptually, it is the same as our example script. It just pulls from the INFORMATION_SCHEMA.SCHEMATA
view to collect the list of datasets, and then iterates through that array to query the individual dataset-level INFORMATION_SCHEMA
views, and then writes that information into a single table for storage. There’s a couple things to note:
- This SQL query creates a stored procedure which you can call to process all the information on-demand.
- The script will write it to a dataset called
metadata_catalog
, but if you want to store this somewhere else just be sure to update that. - This automatically stores everything in a table called
metadata_all
, and it will run aCREATE OR REPLACE TABLE
each time it’s run, so it will always have the latest data, and only the latest data. It will not keep a history of schema changes.
If you want to run through this as an example, just follow these steps.
- Pull the repository again to get the latest scripts (or just manually download all the files from here if you don’t want to get the entire thing).
- Run
./runmefirst.sh
again. Even if you ran it the last time, one of the tables is set with an expiration time so it may have disappeared on you. - Run
./runmesecond.sh
. This script will generate a second dataset with a table in it, and will also generate a dataset calledmetadata_catalog
. - Run the
sproc_generate_Metadata_All.sql
query. This will generate a stored procedure calledmetadata_catalog.generateMetadataAll
. - In the BigQuery console, when you bring up the details on the stored procedure, click the button that says “Invoke Stored Procedure.” This will automatically populate the query editor with the proper way to invoke it. It will look like:
DECLARE metadata_dataset STRING DEFAULT NULL;
CALL `[proj].metadata_catalog.generateMetadataAll`(metadata_dataset);
From here, you can either change NULL
to your dataset so it reads ...STRING DEFAULT 'metadata_catalog';
or you can get rid of that line and just change the input parameter of the function itself to have 'metadata_catalog'
in it directly. Either way works.
It might take a minute or so to generate it depending on how many datasets you have, but you should now have a shiny new table which you can query and/or incorporate into other processes!
If you use this, be sure to incorporate this into your deployment workflow so it updates regularly or when you anticipate structures will be changed, so your table has the latest updated data. Hopefully this makes your life a little bit easier when working with the INFORMATION_SCHEMA
, and also be sure to check out more BigQuery stuff in the Google Cloud Medium Channel! Happy Querying!