會員筆記: 補充文件

忘了可以直接問ChatGPT

施昕揚
3 min readJul 26, 2024

註1: 數據分組

如果不太確定目標欄位的分布,可以根據欄位的最小值(min)跟最大值(max)來做組別的上下界,再去分適當的組距來判斷欄位的分布為何。底下是datacamp提供的範例:

-- Bins created
WITH bins AS (
SELECT generate_series(2200, 3050, 50) AS lower,
generate_series(2250, 3100, 50) AS upper),
-- Subset stackoverflow to just tag dropbox (Step 1)
dropbox AS (
SELECT question_count
FROM stackoverflow
WHERE tag='dropbox')
-- Select columns for result
-- What column are you counting to summarize?
SELECT lower, upper, count(question_count)
FROM bins -- Created above
-- Join to dropbox (created above), keeping all rows from the bins table in the join
LEFT JOIN dropbox
-- Compare question_count to lower and upper
ON question_count >= lower
AND question_count < upper
-- Group by lower and upper to count values in each bin
GROUP BY lower, upper
-- Order by lower to put bins in order
ORDER BY lower;

百分位數的話可以用APPROX_QUANTILES 或 PERCENTILE_CONT。當你需要確保查詢的百分位數是資料集中的一個實際值時,PERCENTILE_DISC 是另一個的選擇。底下是範例:

SELECT
APPROX_QUANTILES(`訂單間隔天數`, 2)[OFFSET(1)] AS `中位數`, --取近似值,較高效
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY `訂單間隔天數`) AS `中位數`, -- 精確值
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY `訂單間隔天數`) AS `中位數` -- 確保是資料集中的一個實際值
FROM order_intervals;

-- Q1第一分位數(25%)
SELECT
APPROX_QUANTILES(`訂單間隔天數`, 4)[OFFSET(1)] AS `Q1`,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY `訂單間隔天數`) AS `Q1`,
PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY `訂單間隔天數`) AS `Q1`
FROM order_intervals;

-- Q3第三分位數(75%)
SELECT
APPROX_QUANTILES(`訂單間隔天數`, 4)[OFFSET(3)] AS `Q3`,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY `訂單間隔天數`) AS `Q3,
PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY `訂單間隔天數`) AS `Q3`
FROM order_intervals;

--

--