How to speed up your BigQuery query 31x by replacing a self-join with two UNNEST() operations

One of my trend calculation queries in the Opinion Analysis project started causing trouble recently. It would run for 270 seconds and break with an error “Query exceeded resource limits for tier 1. Tier 8 or higher required.” Changing the billing tier to the recommended Tier 8 did help, but the query still took 380 seconds to complete.
After some research I found out that the self-joins I use in part of the query were the culprit. Take a look at the join condition at the bottom of this query (in the CalcStatCombiTopics temp table):

INSERT INTO opinions.stattopic (...)
WITH
p AS (
SELECT 20170630 AS SnapshotDateId
),
CalcStatSentiments AS (
SELECT p.SnapshotDateId, t.Tag, ... s.SentimentHash,...
FROM opinions.document d, p
INNER JOIN opinions.sentiment s ON s.DocumentHash = d.DocumentHash, UNNEST(s.Tags) AS t
INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
AND wrRepost.CollectionItemId = d.CollectionItemId
WHERE
d.PublicationDateId = p.SnapshotDateId AND s.SentimentTotalScore > 0
),
CalcStatTopics AS (

),
CalcStatCombiTopics AS (
SELECT
css1.SnapshotDateId, CONCAT(css1.Tag,’ & ‘,css2.Tag) AS Topic, [css1.Tag,css2.Tag] AS Tags, true AS GoodAsTopic, 2 AS TagCount,
...
FROM
CalcStatSentiments css1, CalcStatSentiments css2
WHERE
css1.SentimentHash = css2.SentimentHash AND
css1.Tag < css2.Tag
GROUP BY css1.SnapshotDateId, css1.Tag, css2.Tag
),

Basically, I have a table “opinions.sentiment” with an identity column SentimentHash, a repeated field Tags, and a bunch of other columns. The Tags column contains an array of textual tags that I extract from text using the Opinion Analysis Cloud Dataflow IndexerPipeline. In the pre-BigQuery days I would have used a separate table to store the tags and link it to the main Sentiment table via the identity column SentimentHash. However, in BigQuery this is much more easily done with repeated fields.

When I calculate trends, I build frequency statistics for combinations of tags (e.g. how many news articles were about both “climate change” and “G-20”). To do so, I flatten the Tags field and create a temporary table CalcStatSentiments that contains individual records for each tag, with fields like SentimentHash as well as the actual tag. I then do a self-join of the CalcStatSentiments table to build what I call “topics” (in the CalcStatCombiTopics temp table).

It turns out that self-joins are bad for (your health) the performance of your queries, as illustrated in this blog post. It suggests replacing the self-joins with Windowing. I considered doing that, but I actually needed to end up with permutations of tags instead of aggregate statistics, for which Windowing would work great, so I came up with another technique.

Announcing Sergei’s Self-Join Elimination Technique ©

Instead of flattening my Tags field only once (in the CalcStatSentiments part of my query), I now flatten it the first time, carry the Tags array in the result set, and then flatten it the second time to emulate the CROSS JOIN operation.

Here is how it looks in the new query:

WITH 
p AS (
SELECT 20170630 AS SnapshotDateId
),
SentimentTags AS (
SELECT p.SnapshotDateId, s.SentimentHash, t.Tag, t.GoodAsTopic, s.Tags AS Tags
FROM p, opinions.sentiment s, UNNEST(s.Tags) AS t
WHERE
s.DocumentDateId = p.SnapshotDateId AND s.SentimentTotalScore > 0
),
SentimentTagCombos AS (
SELECT st.SnapshotDateId, st.SentimentHash, st.Tag AS Tag1, stt.Tag AS Tag2
FROM SentimentTags st, UNNEST(st.Tags) stt
WHERE st.Tag < stt.Tag
),
...

The inequality filter `WHERE st.Tag < stt.Tag` ensures that I do not get duplicates in my tag combos. It works the same way as the inequality filter in the original version of my query.

WHERE
css1.SentimentHash = css2.SentimentHash AND
css1.Tag < css2.Tag

Once I calculated my Tag1 & Tag2 combinations, I join my result set via the SentimentHash record ID to my main dataset and conclude all the calculations.

The result: instead of a query that takes 380 seconds to complete in billing tier 8, my modified query runs in 12 seconds in billing tier 1.

Conclusion:

Using self-joins = BAD. 
Replacing them with dual UNNEST() = Priceless! (or something like that)

Here are the queries if you want to compare the syntax.

New, performance-optimized query:

Original Query:

Show your support

Clapping shows how much you appreciated Sergei Sokolenko’s story.