Bigquery: 使用 合併 Join和 Array重構程式碼範例

許博淳
數據共筆
Published in
8 min readJun 19, 2022

這邊先講一下我重構的邏輯

  • 執行結果要相同,絕對要,這點沒有妥協餘地
  • 程式碼要更好懂,但不一定要更快,人讀得懂才回好維護。
  • 不一定要一次重構到最好,而是每次修改都比前一次更好,但相同的邏輯我會建議一次改好。

這次的重構是工作上實際用到的程式碼,先來個懶人包講重構前後差異。

1836行 v.s 543行

重複邏輯合併

陣列取代重複程式碼

重複邏輯合併

  ...
_2_up_topic AS (
SELECT
A.content_id AS content_id,
A.content_pretty_name AS content_pretty_name,
A.content_kind AS content_kind,
A.content_live AS content_live,
A.topic_key_name AS _1_up_topic_key_name,
A.topic_title AS _1_up_topic_title,
A.topic_live AS _1_up_topic_live,
A.topic_id AS _1_up_topic_id,
B.topic_id AS _2_up_topic_id,
B.topic_title AS _2_up_topic_title,
B.topic_key_name AS _2_up_topic_key_name,
B.topic_live AS _2_up_topic_live,
A.version_id AS version_id
FROM
topic_content_add_content_info AS A
LEFT JOIN
topic AS B
. . .
),
-- 第3層
_3_up_topic AS (
SELECT
A.content_id AS content_id,
A.content_pretty_name AS content_pretty_name,
A.content_kind AS content_kind,
A.content_live AS content_live,
A._1_up_topic_id AS _1_up_topic_id,
A._1_up_topic_key_name AS _1_up_topic_key_name,
A._1_up_topic_title AS _1_up_topic_title,
A._1_up_topic_live AS _1_up_topic_live,
A._2_up_topic_id AS _2_up_topic_id,
A._2_up_topic_title AS _2_up_topic_title,
A._2_up_topic_key_name AS _2_up_topic_key_name,
A._2_up_topic_live AS _2_up_topic_live,
B.topic_id AS _3_up_topic_id,
B.topic_title AS _3_up_topic_title,
B.topic_key_name AS _3_up_topic_key_name,
B.topic_live AS _3_up_topic_live,
A.version_id AS version_id
FROM
_2_up_topic AS A
LEFT JOIN
topic AS B
ON
. . .
),
...

從上述程式碼我們可以發現,第二層是 topic_content_add_content_info join topic,第三層又拿第二層結果再 join topic。

原始的程式碼之中,這一段長達11層,也就是要反覆做類似工作 10次。

既然每一次都是 Left Join,那我就一次全部 Left Join就好啦!

_2_to_10_up_topic AS (
SELECT
. . .
FROM
topic_content_add_content_info AS A
LEFT JOIN
topic AS B
ON
...
LEFT JOIN
topic AS C
ON
...

陣列取代重複程式碼

layer_cnts AS (
SELECT
*,
CASE
WHEN
_2_up_topic_key_name IS NULL
THEN
1
WHEN
_3_up_topic_key_name IS NULL
THEN
2
...
ELSE
11
END
AS layer_cnt
FROM
_11_up_topic
),
-- 請忽略「知識樹」的那一層
add_level_name AS (
SELECT
...
CASE
WHEN
layer_cnt = 1
THEN
_1_up_topic_title
WHEN
layer_cnt = 2
...
THEN
_11_up_topic_title
END
AS level0_name,
CASE
WHEN
layer_cnt = 1
THEN
_1_up_topic_id
WHEN
layer_cnt = 2
THEN
_2_up_topic_id
...
THEN
_11_up_topic_id
END
AS level0_id,
CASE
WHEN
layer_cnt = 1
THEN
_1_up_topic_live
WHEN
layer_cnt = 2
THEN
_2_up_topic_live
...
THEN
_11_up_topic_live
END
AS level0_live,
...
FROM
layer_cnts

均一目前有10 層,可想而知這樣的判斷式就會非常的長,同時邏輯難以理解。

這邊我理解邏輯後,主要是做幾件事

  1. 要知道有幾層

2. 如果有 A層,那我在 B層的欄位應該要填入什麼值

如果寫過其他語言的小夥伴應該看得出來這可以用陣列解開。

layer_array_of_each_content AS (
SELECT
[
COUNTIF(_1_up_topic_key_name IS NOT NULL) > 0,
COUNTIF(_2_up_topic_key_name IS NOT NULL) > 0,
...
COUNTIF(_11_up_topic_key_name IS NOT NULL) > 0
] AS layer_array
FROM
_2_to_10_up_topic
...
),
topic_array_of_each_content AS (
SELECT
[
IFNULL(_1_up_topic_title, "NULL"),
IFNULL(_2_up_topic_title, "NULL"),
...
IFNULL(_11_up_topic_title, "NULL")
] AS title_array,
[
IFNULL(_1_up_topic_id, "NULL"),
IFNULL(_2_up_topic_id, "NULL"),
...
IFNULL(_11_up_topic_id, "NULL")
] AS id_array,
[
IFNULL(_1_up_topic_live, FALSE),
IFNULL(_2_up_topic_live, FALSE),
...
IFNULL(_11_up_topic_live, FALSE)
] AS live_array,
FROM
_2_to_10_up_topic
),
layer_count_of_each_content AS (
...
),
add_level_name AS (
SELECT
...
IF(A.layer_count > 0, B.title_array[OFFSET(A.layer_count - 1)], NULL) AS level0_name,
IF(A.layer_count > 0, B.id_array[OFFSET(A.layer_count - 1)], NULL) AS level0_id,
IF(A.layer_count > 0, B.live_array[OFFSET(A.layer_count - 1)], NULL) AS level0_live,
IF(A.layer_count > 1, B.title_array[OFFSET(A.layer_count - 2)], NULL) AS level1_name,
IF(A.layer_count > 1, B.id_array[OFFSET(A.layer_count - 2)], NULL) AS level1_id,
IF(A.layer_count > 1, B.live_array[OFFSET(A.layer_count - 2)], NULL) AS level1_live,
...
FROM
...

如果更聰明點的你,應該會覺得這邊還是不夠簡潔。

沒錯,這邊可以用迴圈解開,但我還沒有在 Big query中實作迴圈的經驗,這邊可能要到下次重構才可以實作,別忘了,我們不一定要一次做到一百分,而是比之前更好。

另外補充,如果涉及欄位名稱操作,可能很難用純 SQL完成,但 Python可以協助。

--

--