New in BigQuery: Persistent UDFs

User defined functions are a powerful way to extend BigQuery, but until now it has been a drag having to copy paste them around. Not anymore! Let’s play with this new feature and share some useful ones — including NLP inside BigQuery.

Felipe Hoffa
Jul 2 · 6 min read

With persistent UDF support in BigQuery, now you can:

  • Create user-defined SQL and JavaScript functions.
  • Reference these functions across queries and in logical views.
  • Create org-wide libraries of business logic within shared datasets.

For example, now anyone can call this number parsing function:

SELECT fhoffa.x.parse_number('one hundred fifty seven')
, fhoffa.x.parse_number('three point 5')
, fhoffa.x.parse_number('2 hundred')
, fhoffa.x.parse_number('minus 8')
, fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')
157, 3.5, 200, -8, 5000325.01

How cool is that? See how to do this below.

Some cool functions I’m releasing now

Radians

Based on a Stack Overflow question:

SELECT fhoffa.x.radians(180) is_this_pi3.141592653589793

To create this function I did:

CREATE OR REPLACE FUNCTION `x.radians`(x ANY TYPE) AS (ACOS(-1) * x / 180);

“Better” cast to integer

My own int() makes me happier than the native CAST(x AS INT64):

SELECT fhoffa.x.int(1) int1
, fhoffa.x.int(2.5) int2
, fhoffa.x.int('7') int3
, fhoffa.x.int('7.8') int4
1, 2, 7, 7

To create this function I did:

CREATE OR REPLACE FUNCTION `x.int`(v ANY TYPE) AS (CAST(FLOOR(CAST(v AS FLOAT64)) AS INT64));

Note that CAST(x AS INT64) rounds the number, but I wish it just truncated it. I don’t have to deal with that behavior anymore, since my own int() takes care of it. Which helps with:

Random integer in a range

Due to the ROUND effects of CAST, getting a random integer isn’t straightforward with the BQ native functions. But now it is, thanks to my own random_int():

SELECT fhoffa.x.random_int(0,10) randint, COUNT(*) c
FROM UNNEST(GENERATE_ARRAY(1,1000))
GROUP BY 1
ORDER BY 1

To create this function I did:

CREATE OR REPLACE FUNCTION `x.random_int`(min ANY TYPE, max ANY TYPE) AS (fhoffa.x.int(min + RAND()*(max-min)));

Note that I’m using my own previously defined int() to cast appropriately!

Medians

To get the median of an array of numbers:

SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000]) median_1
, fhoffa.x.median([1,2,3]) median_2
, fhoffa.x.median([1,2,3,4]) median_3
3.0, 2.0, 2.5

To use with real data:

SELECT year
, fhoffa.x.median(ARRAY_AGG(weight_pounds)) as median_weight
FROM `bigquery-public-data.samples.natality`
GROUP BY 1
ORDER BY 1

To create this function Elliott did:

CREATE OR REPLACE FUNCTION `fhoffa.x.median`(arr ANY TYPE) AS ((
SELECT IF (
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));

Fun with JavaScript

Now it’s easier than ever to encapsulate JavaScript functions and its dependencies. For example, I downloaded the latest NLP compromise library to create the following functions.

Parse numbers from text

The opening example:

SELECT fhoffa.x.parse_number('one hundred fifty seven')
, fhoffa.x.parse_number('three point 5')
, fhoffa.x.parse_number('2 hundred')
, fhoffa.x.parse_number('minus 8')
, fhoffa.x.parse_number('5 million 3 hundred 25 point zero 1')
157, 3.5, 200, -8, 5000325.01

To create this function I did:

CREATE OR REPLACE FUNCTION x.nlp_compromise_number(str STRING)
RETURNS NUMERIC LANGUAGE js AS '''
return nlp(str).values(0).toNumber().out()
'''
OPTIONS (
library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Note that I’m using a JS library I downloaded straight from the internet — BigQuery is able to work with it seamlessly. But I had to share the copy I stored in GCS with you for this to work:

gsutil acl ch -u AllUsers:R gs://fh-bigquery/js/*

NLP — getting names out of text

Extracting names out of text:

SELECT fhoffa.x.nlp_compromise_people(
"hello, I'm Felipe Hoffa and I work with Elliott Brossard - who thinks Jordan Tigani will like this post?"
) names

Let’s find the top names mentioned on reddit /r/movies last February:

SELECT name, COUNT(*) c
FROM (
SELECT fhoffa.x.nlp_compromise_people(title) names
FROM `fh-bigquery.reddit_posts.2019_02`
WHERE subreddit = 'movies'
), UNNEST(names) name
WHERE name LIKE '% %'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

To create this function I did:

CREATE FUNCTION x.nlp_compromise_people(str STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
return nlp(str).people().out('topk').map(x=>x.normal)
'''
OPTIONS (
library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Fuzzy matching

Soundex

Based on Brian Suk’s post: fhoffa.x.soundex()

To create this function I did:

CREATE OR REPLACE FUNCTION `fhoffa.x.soundex`(s STRING)
RETURNS STRING LANGUAGE js AS """
// https://gist.github.com/shawndumas/1262659
var a = s.toLowerCase().split(''),
f = a.shift(),
r = '',
codes = {
a: '', e: '', i: '', o: '', u: '',
b: 1, f: 1, p: 1, v: 1,
c: 2, g: 2, j: 2, k: 2, q: 2, s: 2, x: 2, z: 2,
d: 3, t: 3,
l: 4,
m: 5, n: 5,
r: 6
};

r = f +
a
.map(function (v, i, a) { return codes[v] })
.filter(function (v, i, a) {
return ((i === 0) ? v !== codes[f] : v !== a[i - 1]);
})
.join('');

return (r + '000').slice(0, 4).toUpperCase();
""";

Levenshtein distance

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')

Uses fuzzball.js:

CREATE OR REPLACE FUNCTION fhoffa.x.levenshtein(a string, b string)
RETURNS INT64
LANGUAGE js AS """
return fuzzball.distance(a,b);
"""
OPTIONS (library="gs://fh-bigquery/js/fuzzball.umd.min.js");

Fuzzy choose one

SELECT fhoffa.x.fuzzy_extract_one('jony' 
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
# , ['john', 'johnny', 'jonathan', 'jonas']
)
johnny

Also with fuzzball:

CREATE OR REPLACE FUNCTION `fhoffa.x.fuzzy_extract_one`(a STRING, b ARRAY<STRING>) RETURNS STRING LANGUAGE js
OPTIONS (library=["gs://fh-bigquery/js/fuzzball.umd.min.js"]) AS """
return fuzzball.extract(a,b)[0][0];
""";

Notes

  • Persistent UDFs are in beta today.
  • UDFs are case-sensitive. I’m asking the team if we could relax this. Tell me if you’d like this too.
  • How to version and source these functions? Share your ideas, and stay tuned.
  • BigQuery can run wasm, so you could write these functions in any programming language that compiles to it (pending an async JS issue Myles Borins has been working to fix).
  • You can create persistent UDFs within the BigQuery sandbox without a credit card. They will be persisted indefinitely (beyond the default 60 day storage for tables in the same dataset).
  • To make my UDFs usable by anyone, I shared the dataset containing them with allAuthenticatedUsers

Next steps

Create and share your own UDFs! Post them on reddit.com/r/bigquery if you’d like more people to find them.

Thanks to Elliott Brossard for leading the development of this feature and helping me out with this post.

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

Felipe Hoffa

Written by

Developer Advocate @Google. Originally from Chile, now in San Francisco and around the world. Let’s talk data.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade