MIT 6.006 Introduction to Algorithms 8

8. Hashing with Chaining

Kuan-Yu Chen

8 min readFeb 27, 2017

概要

從上一堂課程 7 ，你可以知道 Search Cost Time：

Comparison Model ( AVL Tree )= N Log N
Radom Access Model ( Assume Item is Integer ) = O ( N )

看似已經很快速了，但這裡將介紹 Hashing 的方式，他能達到 Constant Time Search！

結論

藉由 Hashing 的幫忙，當 Table Size m = O( n ) 時，可以達到：

Insert / Delete / Search Time Cost = O ( 1 )

Hashing的應用

先來講講Hashing吧，他其實應用很多：

你會發現 Hashing 是一個最常使用且重要的 Data Structure，就連平時使用的 Dictionary 也是利用 Hashing 來完成的。

我們將利用介紹 Dictionary 是如何做到的，來告訴你 Hashing 的價值在哪，以及為何能快速做到 Search。

如何實作 Dictionary？

Dictionary 的實作，基本建築在 Direct-access Table 的模型上，利用克服一些難處來達到 Dictionary的效果。

Definition of Direct-access Table

Store items in array indexed by key
Search Time Cost = O(1)

Direct-access Table 其實就是一個簡單版的 Dictionary ，但是 Key 只能是數字。利用 RAM 快速存取修改 Array 的特性，可以快速搜尋 / 修改你要的物件

舉個例子：你想要查找一篇文章，想知道文章出現幾個數字 0～ 9
這時你可以使用Counting Sort ，計算數字 0～ 9 出現的次數，並且記錄在 array [0] ~ array[9] 上，遍歷一次文章，依次數增加 array 上的統計數據，得到最後統計結果。
但…
如果我要你統計的是英文 A ~ Z 呢？ (Keys May Not Be non-Negative Integer)
如果我要你統計的是英文單字出現的次數呢？你必須建立一個涵蓋所有英文單字的 Array ？ (Gigantic Array Size)

從以上的例子，你不難看出 Direct-access Table 的缺點：

Keys May Not Be non-Negative Integer
Gigantic Memory Size

接下來我們將利用 Hashing 的方式來解決！

解決『Keys May Not Be non-Negative Integer 』的問題

有時候我會想要儲存的 key 是一個英文單字，這時候該怎麼辦……..那我就把它轉成數字就好啊！

preHashStringToInteger("apple") = 10110001111010010

利用上面這樣的 method，我們將單字轉成 Integer，這樣就可以放到 Direct-access Table 裡面了！

這樣的 method 我們稱為 Pre Hashing

Pre Hashing

Map Keys to Non-negative Integers

In Theory, Keys Are Finite and Discrete

解決『 Gigantic Memory Size 』的問題

如果今天要計算英文單字出現的次數，你需要有一個巨大的 Array 來統計所有的單字，這樣該如何解決…..Hashing！

將這個巨大的 Key Space 再一次的 Hashing，變成比較小的 Key Space

hashToSmallKeySpace(10110001111010010) = 101

Hashing

Reduce Universe of All Keys ( Integers ) Down to Reasonable Size m for Table
In Ideal : m ~= Number of Keys In Dictionary

利用再一次的 Hashing 解決空間過大的問題，同時你也會發現一個問題：有可能不同的 Key，卻 Hashing 出同一個新值

hashToSmallKeySpace(10110001111010010) = 101
hashToSmallKeySpace(00101010000100001) = 101

這樣的情況我們稱之 Collision

所以如果在壓縮 Hashing 的過程中，發現 Table 裡面已經存在一個物件了（Collision），這時候該怎麼辦…….？

那就把它放在一起吧！( Linked List )

把 Table 中的 Slot 都用 Linked List 的方式儲存，這樣就不用擔心 Collision 發生，沒位置放新的物件，這樣的手法稱為 Chaining

Chaining

Linked List of Colliding Elements in Each Slots of Hash Table

但在 Worst Case 的情況（你有一個很爛的 Hashing Method），每次 Hash 出來都到同一個 Slot，使得這個 Slot 的 Linked List 長度為 N。這時你就需要尋著 Linked List 一個一個去查找，Search 速度會變慢…

Worst Case: Search Time Cost = O( |Chaining Linked List| ) = O( N )

從這裡可以看出，一個好的 Hashing method 可以讓你有 Search Time Cost = O(1) 的效果（ Linked List 長度 = 1）

Chaining絕對不是我們想要的，是逼不得已的

In Worst Cast, Search Time Cost = O ( N )
但這其實不太可能發生，因為 Hashing 有 Randomization的特性

Best Case = O( 1 )

當 Chaining 長度少的時候，Search Time Cost 可以達到 O( 1 )的速度。

如何證明？

這邊引用了一個錯誤的假設（雖然錯誤，但是方便解釋）：

Simple Uniform Hashing

Each key is equally to be hashed into any slot of table independent of where other keys hashing

由於這條平整性的假設，可以得知，一個 n 個 keys，m 個 slot的 Table：

Expected Length of Chaining = n / m

當 m = O( n ) 的時候（m 的大小接近 n 時）：

Length of Chaining = O( 1 )

Search Time Cost = O( |Chaining Linked List| ) = O( n / m ) = O( 1 )

Expected Length of Chaining也稱之為 Load Factor
Load Factor = n / m

Hashing Method

一個好的 Hashing method，不容易有 Collision 發生，也意味著更好的搜尋效率（不容易 Chaining），影片中有提到了以下三種 hashing 的方式：

Division Method

hash(k) = k mod m

直接將 k 這個巨大的數字，模除 m （Number of Keys In Dictionary）後，作為新的 key

什麼是模除？請參考：

模除 - 维基百科，自由的百科全书

Edit description

zh.wikipedia.org

Multiplication Method

hash(k) = [ (a * k) mod (2 ^ w) ] shift (w - r)

這種方式，是把自己乘上一個 a，然後在 mod，但只取混合最雜亂的部分，請參考下圖右邊：

你會發現乘上 a 後代表你用自己去拼湊出新的物件，然後再取出疊加次數最多的 r 部分（最混亂的部分）

Universal Hashing

利用 a, b 兩個 random 的參數，來hashing，以這樣 randomize 的方式可以達到很好的 Hash 效果，有點複雜，詳細地描述請看 6.046 的影片囉

實作 Dictionary的過程中，我們知道 Table 的大小需要 m > n，但…如果是動態的情況呢？例如：我一直 insert 新值，Table 大小不夠用了，何時該把 Table 變大，何時要變小？

這在下堂課會提到：

MIT 6.006 Introduction to Algorithms 9

9. Table Doubling, Karp-Rabin

medium.com