Hive SQL 基礎語法學習筆記

Bmo Yang
Life In data
Published in
3 min readJan 14, 2019

前言

因為自身工作與專案需求,使用到Hadoop倉儲系統,故本次要來介紹有關Hive SQL的基礎語法。
其中,Hive SQL與SQL的語法大同小異,基本上是相通的,學過SQL的使用者可以無痛使用Hive SQL。
因此,本次的學習筆記僅會介紹常用或需要注意的語法,其他的Hive SQL特點,如partition會另外介紹。

學習重點

  • 了解Hadoop與Hive SQL的基本觀念
  • 學習Hive SQL基礎語法,能夠自行查詢資料、建立資料表、篩選欄位等

Hadoop與Hive SQL簡介

可以發現有關數據分析都和動物類有關…
  1. Apache Hive 是適用於 Apache Hadoop 的資料倉儲系統。
  2. Hive 可執行資料摘要、查詢以及資料分析。
  3. Hive 查詢是以 HiveQL 撰寫而成,這是類似 SQL 的查詢語言。
  4. Hive 可讓您將結構投影在大量非結構化資料上。
  5. 定義結構後,您不需具備 Jave 或 MapReduce 相關知識,即可使用 HiveQL來查詢資料。

Hive SQL基礎語法

檢視資料庫

檢視資料庫有哪些資料表

SHOW DATABASES;

檢視資料庫的位置與底下的資料表

DESCRIBE DATABASE financials

檢視資料庫底下的資料表更多額外的資訊

DESCRIBE DATABASE EXTENDED financials;

檢視資料表

透過DESCRIBE EXTENDED來檢視資料表資訊

DESCRIBE EXTENDED mydb.employees;

DESCRIBE FORMATTED mydb.employees;

Creating Tables

The CREATE TABLE statement follows SQL conventions.
在Hive SQL中,建立資料表的語法和SQL語法一樣。

CREATE TABLE IF NOT EXISTS mydb.employees (…)

  • 建立一個資料表名為mydb.employees。
  • IF NOT EXISTS 是一個可忽略語法,若加入此語法,Hive會辨識資料庫是否有此資料表,若有則不會建立此表,若無則會建立表格。

Select Table

Select

Hive 中的Select的語法跟隨SQL Select語法。
explode:Return 0 to many rows, one row for each element from the input array.

SELECT explode(subordinates) AS sub FROM employees;

Mary Smith
Todd Jones
Bill King

Aggregate functions

SELECT count(*), avg(salary) FROM employees;

Use SET hive.map.aggr=true improve the performance of aggregation.

SET hive.map.aggr=true;
SELECT count(*), avg(salary) FROM employees;

JSON

在Hive中,支援JSON(Java- Script Object Notation)格式。

SELECT name, deductions FROM employees;

John Doe    {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} 
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"StateTaxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"StateTaxes":0.03,"Insurance":0.1}

可以直接利用key value篩選資料。

SELECT name, deductions[“State Taxes”] FROM employees;

John Doe 0.05
Mary Smith 0.05
Todd Jones 0.03
Bill King 0.03

WHERE Clauses

WHERE 語法跟隨SQL語法,寫法無太大差異,僅列出需要注意的地方。

SELECT e.*
FROM
(
SELECT name, salary, deductions[“Federal Taxes”] as ded,
salary * (1 — deductions[“Federal Taxes”]) as salary_minus_fed_taxes
FROM employees
) e
WHERE round(e.salary_minus_fed_taxes) > 70000;

  • 若在SELECT時,有新增欄位名稱的話,必須把WHERE條件寫在外層,並給定是哪一張資料表的欄位,如此WHERE才能明確辨識需要篩選的條件。

GROUP BY Clauses

GROUP BY 語法跟隨SQL語法,寫法無太大差異。

SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange = ‘NASDAQ’ AND symbol = ‘AAPL’
GROUP BY year(ymd);

HAVING Clauses

HAVING clause 可以在GROUP BY 資料時進行子查詢(subquery),使語法更簡易。

SELECT year(ymd), avg(price_close)
FROM stocks
WHERE exchange = ‘NASDAQ’ AND symbol = ‘AAPL’
GROUP BY year(ymd)
HAVING avg(price_close) > 50.0;

若不使用HAVING的話,則需要利用nested SELECT statement與WHERE來完成。

SELECT s2.year, s2.avg
FROM
(
SELECT year(ymd) AS year,
avg(price_close) AS avg
FROM stocks
WHERE exchange = ‘NASDAQ’ AND symbol = ‘AAPL’
GROUP BY year(ymd)
) s2
WHERE s2.avg > 50.0;

ORDER BY and SORT BY

  • ORDER BY 語法跟隨SQL語法,無太大差異。
  • SORT BY可以參考以下英文敘述:
    SORT BY,that orders the data only within each reducer,thereby performing a local ordering,where each reducer’s output will be sorted.
    Better performance is traded for total ordering.

ORDER BY

SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
ORDER BY s.ymd ASC, s.symbol DESC;

SORT BY

SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
SORT BY s.ymd ASC, s.symbol DESC;

  • The two queries look almost identical, but if more than one reducer is invoked, the output will be sorted differently.
  • While each reducer’s output files will be sorted, the data will probably overlap with the output of other reducers.
  • Because ORDER BY can result in excessively long run times, Hive will require a LIMIT clause with ORDER BY if the property hive.mapred.mode is set to strict. By default, it is set to nonstrict.
  • 雖然在寫法上一樣,只是換個語法,但是當reducer(要依據排序的欄位)增加時,SORT BY的運算效能會比ORDER BY好。

DISTRIBUTE BY with SORT BY

  • DISTRIBUTE BY control show map output is divided among reducers.
  • All data that flows through a MapReduce job is organized into key-value pairs.
  • Hive must use this feature internally when it converts your queries to MapReduce jobs.
  • DISTRIBUTE BY 與 GROUP BY類似,但效能較GROUP BY佳。
  • DISTRIBUTE BY必須在SORT BY之前。

SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
DISTRIBUTE BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC;

CLUSTER BY

  • 在DISTRIBUTE BY with SORT BY的例子中,s.symbol是DISTRIBUTE BY的條件,s.symbol ASC, s.ymd ASC是SORT BY的條件。
  • 假如今天DISTRIBUTE BY與SORT BY的條件欄位相同的話,可以使用CLUSTER BY。

SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
CLUSTER BY s.symbol;

學習心得

如果熟悉基本的SQL語法,基本上在使用HIVE SQL時,可以很迅速上手。
而HIVE SQL也有幾個特別的語法指令在處理上相較SQL更有效率,多學幾個語法更能夠讓資料處理的流程更順暢。

--

--