BigQuery beyond SQL and JS: Running C and Rust code at scale

Running arbitrary JavaScript code with a SQL query is cool — but you know what’s cooler? BigQuery running arbitrary C and Rust code. Francesc explained how, and in this post will figure out how to make this run at scale.

It all started here:

As you can see we were all impressed:

3 seconds to add 2 numbers isn’t impressive at all — but a couple hours later, we were able to run this same function at scale:

1 row in 3 seconds is not impressive. 5 billion rows in 56 seconds is. Let’s see how we went from a to b:

BigQuery strengths: Throughput, not latency

If you are expecting results in less than one second, BigQuery is not the right tool: Usually BigQuery won’t return results in less than one second.

This because BigQuery was built for throughput, not for latency. A race car will always be faster than a heavy-weight truck — except when you ask the race car to move 3 tons of cargo from one place to another. That’s when you bring BigQuery into your life: not for sub-second operations, but to analyze tons of data effortlessly.

Nevertheless, scaling up the results of this Web Assembly function in BigQuery wasn’t as easy.

BigQuery UDF tricks: For slow UDFs, batch the requests

In Francesc’s initial UDF, most of the time is spent initializing Web Assembly. Each time we pass a new row to the UDF, a long time is spent initializing the function again.

To solve this problem we can batch the requests to the UDF: We will group big chunks of rows arbitrarily, so the UDF can initialize once and process each row in the chunk without the need to initialize again.

This means wrapping a source like this:

SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205]

into random groups, with code like this:

SELECT FLOOR(RAND()*100000) group, NEST(requests) as x 
FROM (
SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201205])
GROUP BY group

For a table of 5 billion rows like this one, this code creates ~100,000 groups of ~50,000 elements each. We can pass these groups to the UDF, to get to our desired performance.

Yes I am :).

Code

Adding 2 numbers in 3 seconds

SELECT * FROM
js(
(SELECT 1 x, 2 y)
, x, y
, "[{name:'s', type: 'float'}]",
"function (row, emit) {
const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
const env = {
'abortStackOverflow': _ => { throw new Error('overflow'); },
'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
'tableBase': 0,
'memory': memory,
'memoryBase': 1024,
'STACKTOP': 0,
'STACK_MAX': memory.buffer.byteLength,
};
const imports = { env };
const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);
WebAssembly.instantiate(bytes, imports).then(wa => {
const exports = wa.instance.exports;
const sum = exports._sum;
emit({s: sum(row.x, row.y)});
});
}"
)
# Query complete (3.0s elapsed, 0 B processed)
# 3.0

Duplicating a number 5 billion times in 48 seconds

SELECT SUM(s)
FROM
js((
SELECT FLOOR(RAND()*100000) group, NEST(requests) as x
FROM (
SELECT requests, content_size
FROM [fh-bigquery:wikipedia.pagecounts_201205]
)
GROUP BY group)
, group, x
, "[{name:'s', type: 'float'}]",
"function (row, emit) {
const memory = new WebAssembly.Memory({ initial: 256, maximum: 256 });
const env = {
'abortStackOverflow': _ => { throw new Error('overflow'); },
'table': new WebAssembly.Table({ initial: 0, maximum: 0, element: 'anyfunc' }),
'tableBase': 0,
'memory': memory,
'memoryBase': 1024,
'STACKTOP': 0,
'STACK_MAX': memory.buffer.byteLength,
};
const imports = { env };
const bytes = new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 139, 128, 128, 128, 0, 2, 96, 1, 127, 0, 96, 2, 127, 127, 1, 127, 2, 254, 128, 128, 128, 0, 7, 3, 101, 110, 118, 8, 83, 84, 65, 67, 75, 84, 79, 80, 3, 127, 0, 3, 101, 110, 118, 9, 83, 84, 65, 67, 75, 95, 77, 65, 88, 3, 127, 0, 3, 101, 110, 118, 18, 97, 98, 111, 114, 116, 83, 116, 97, 99, 107, 79, 118, 101, 114, 102, 108, 111, 119, 0, 0, 3, 101, 110, 118, 6, 109, 101, 109, 111, 114, 121, 2, 1, 128, 2, 128, 2, 3, 101, 110, 118, 5, 116, 97, 98, 108, 101, 1, 112, 1, 0, 0, 3, 101, 110, 118, 10, 109, 101, 109, 111, 114, 121, 66, 97, 115, 101, 3, 127, 0, 3, 101, 110, 118, 9, 116, 97, 98, 108, 101, 66, 97, 115, 101, 3, 127, 0, 3, 130, 128, 128, 128, 0, 1, 1, 6, 147, 128, 128, 128, 0, 3, 127, 1, 35, 0, 11, 127, 1, 35, 1, 11, 125, 1, 67, 0, 0, 0, 0, 11, 7, 136, 128, 128, 128, 0, 1, 4, 95, 115, 117, 109, 0, 1, 9, 129, 128, 128, 128, 0, 0, 10, 196, 128, 128, 128, 0, 1, 190, 128, 128, 128, 0, 1, 7, 127, 2, 64, 35, 4, 33, 8, 35, 4, 65, 16, 106, 36, 4, 35, 4, 35, 5, 78, 4, 64, 65, 16, 16, 0, 11, 32, 0, 33, 2, 32, 1, 33, 3, 32, 2, 33, 4, 32, 3, 33, 5, 32, 4, 32, 5, 106, 33, 6, 32, 8, 36, 4, 32, 6, 15, 0, 11, 0, 11]);
WebAssembly.instantiate(bytes, imports).then(wa => {
const exports = wa.instance.exports;
const sum = exports._sum;
for (var i = 0, len = row.x.length; i < len; i++) {
emit({s: sum(row.x[i], row.x[i])});
}
});
}"
)

Step by step

See Francesc’s https://blog.sourced.tech/post/calling-c-functions-from-bigquery/.

FAQ

Why not #standardSQL

BigQuery traditionally used its own variant of SQL — which we are using here. But lately BigQuery has taken a huge step into adopting standard SQL. Nevertheless we had to use legacy SQL here, as UDFs in the new model don’t have a way to handle asynchronous returns. Stay tuned.

Next steps

Will Francesc be able to run source{d} C library inside BigQuery? Stay tuned!