Hidden gems of BigQuery — Part 3

Artem Nikulchenko
Google Developer Experts
4 min readMar 20, 2023

--

This series started because of one BQ feature that was not listed in any documentation (but was very useful, in my opinion). This started the first post on Hidden gems of BigQuery Google Cloud Blog (Transactions and Indexes were also included). Later this feature was replaced with new cool features (Change History, Defaults), with resulted in Hidden gems of BigQuery — Part 2. But BiqQuery progress never stops (thanks to the amazing BQ Team), so it is time for Part 3.

  • Data Lineage
  • Routines
  • Authorized routines

Data Lineage

Note: At the time of writing, this feature is in Preview.

If you really use BigQuery as your Data Warehouse — you most likely run an ELT process, during which you first move lots of your data from one or multiple sources to BigQuery and then “massage” it inside BigQuery to get OLAP-style structure that you want.

The benefit of ELT versus ETL is that all transformation happens inside BigQuery (and using SQL), which makes it much more transparent and easier to support than traditional ETL processes.

But BQ Team came out with a way to make it even more transparent — Data lineage!

Data lineage allows seeing in an easy visual format how data was transformed from the source to the target:

Of course, the graph for a real use case can look more complicated:

Routines

While the primary goal of any DBMS system is to store and query data, most of them offer additional functionality, including the ability to write functions and stored procedures within DB itself (and usually using SQL). In OLTP databases, those routines can be used to abstract complicated calculations from DB Users, ensure consistency, provide an additional layer of abstraction, etc. In OLAT databases, stored procedures can be used to support the ELT process, and function — to encapsulate complicated calculations.

And BigQuery supports it all!

There are several types of routines that BigQuery supports:

  • User-defined functions
  • Table functions
  • Remote functions
  • Stored procedures
  • Stored procedures for Apache Spark

There is excellent documentation on all of them here, so we won’t go into details and just quickly walk over use cases for them.

If you used to write UDFs in SQL in your OLTP databases — you would find BigQuery SQL UDFs very similar and easy to use for the same purposes.

CREATE FUNCTION AddFourAndDivide(x INT64, y INT64)
RETURNS FLOAT64
AS (
(x + 4) / y
);

CREATE FUNCTION CountFunction()
AS (
(SELECT COUNT(*) FROM SomeTable)
);

If, however, you prefer JS — that is also supported!

CREATE TEMP FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js
AS r"""
return x*y;
""";

Table-valued functions (TVF) are also very similar to your regular OLTP TVFs (no JS, only SQL for those :) ).

CREATE OR REPLACE TABLE FUNCTION mydataset.names_by_year(y INT64)
AS (
SELECT year, name, SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = y
GROUP BY year, name
);o

Remote functions can be used in case you want to make a call from a BigQuery statement to an HTTP endpoint (Cloud Functions and Cloud Run are supported currently). This gives unlimited flexibility (although this is also a little bit more work to set up), and you can have complete freedom in deciding how to write your code and what it should do. You, of course, should be careful that your code is executed in a reasonable time and that it consumes a reasonable amount of resources (cause Cloud Function and Cloud Run resources would be billed separately).

Note: I personally would not use Remote functions if I wanted to perform some calculations that are possible in SQL/JS, but I knew Python or .NET Core better. However, when I need to pull some real-time data from an external source (that is not supported by Federated Queries) or makes calls to ML models outside of BQ— that is a great option.

Finally, BQ has Stored Procedures. At this point, only SQL is supported, which makes BQ SP very similar (and easy to understand and use) to OLTP SP that you may be used to. The main difference between UDFs and SP — UDFs can’t modify data, while SP can (similar to MS SQL in that regard).

CREATE PROCEDURE dataset_name.procedure_name
BEGIN
-- statements here
END

While Routines is a great BigQuery feature, Authorize routines feature makes them even more useful!

Authorize routines

For UDF, TVF, and now SP (in preview), BigQuery supports what is called authorized routines.

Authorized routines let you share query results with specific users or groups without giving those users or groups access to the underlying tables. For example, an authorized routine can compute an aggregation over data or look up a table value and use that value in a computation.

By default, if a user invokes a routine, the user must have access to read the data in the table. As an alternative, you can authorize the routine to access the dataset that contains the referenced table. An authorized routine can query the tables in the dataset, even if the user who calls the routine can’t query those tables directly. (from BigQuery documentation).

This allows you to create a layer of abstraction between your data and the consumers of this data. For example, you can collect your Customers' data (including personal data) for aggregated analysis. You want to make sure that nobody has access to raw data, but still gives access to aggregated summary data. Authorized UDF would be a solution to that problem.

And now, with Authorized SP — you can even give the User the ability to modify (or recalculate and cache in temp tables) some data without direct access to underlying tables.

--

--

Artem Nikulchenko
Google Developer Experts

Chief Software Architect with 10+ year of experience, PhD, Associate Professor, IT Univer co-organizer, GDG Organizer