Course 2 — Data structure — Part 3: Hash

humaneee
10 min readSep 2, 2017

--

We use hash(dictionary) in our daily work. It’s obviously the most important data structure beside basic data structures array, list(linked-list) and tree. How can we build a hash from these basic data structures?

Let’s start with an example Phone Book. We need a hash to save the mapping from phone number to name and name to phone number:

name_to_number = { 'john' : '0930-345-234', 'alice' : '0934-234-546', ... }
number_to_name = { '0930-345-234' : 'john', '0934-234-546' : 'alice', ... }

For the large Phone book, we need to build a hash with operations add, remove, find as fast as possible(O(1)). We can use Array or List to build “number_to_name” hash.

Using Array — Direct Addressing

To implement a hash using an array, firstly we convert phone number to number:

'0930-345-234' -> 0930345234

Our job is much much simple, we need to allocate big array. The phone number will be the index, the value is name.

number_name = [....................., john, ........, alice, ......]

As you can see, operation add, remove costs O(1). But the operation search costs O(n) and we have allocate a huge space . It’s not efficient in practice.

Using List

Unlike using array, using list will help us to save space. We just new data to list but the operation search still costs O(n) to perform.

So, what is better implementation for hash?

The answer will be easier than you expect, we still use Array and List together. Instead of allocating huge space and use phone number as index, we will transform phone number to smaller number — called hashing and we just need to allocate a small space.

For example, we have 10 phone numbers: A1, A2, …., A10. We will transform it to smaller number in : B = [1, 2, 3, 4, 5].

A1, A3 -> 2
A2, A4 -> 1
A5 -> 5
A6, A10 -> 4
A7, A8, A9 -> 3

If we can do that, we just need allocating array B with 5 elements, each element in B is a List the contain Ai, Aj has the same hashing value.

Now, we need to find out a good hash, a good hash h is:

  • h should be fast to compute.
  • Less collisions. When h(A1) = h(A2) but A1 != A2, this is a collision.

If your hashing function has many collisions, the worst case will be like bad example.

Hash integer

We want to transform phone number into smaller set — 1000 elements.

  • Take phone numbers up to length 7, for example 148–25–67.
  • Convert phone numbers to integers from 0 to 10⁷− 1 = 9 999 999: 148–25–67 → 1 482 567
  • Choose prime number bigger than 10⁷ , e.g. p = 10 000 019
  • Choose hash table size, e.g. m = 1 000

We will make phone number bigger than prime p, and modulo to m, the result will be a set of hashing values that smaller than m.

We just choose a, b number big enough to be multipliers. So ax +b will be larger than p. We modulo p and modulo m to get the result.

Select a = 34, b = 2, so h = h(34,2,p) and consider x = 1 482 567 corresponding to phone number 148–25–67. p = 10 000 019.h(x) = ((34 × 1482567 + 2) mod 10000019) mode 1000 = 185

Hash string

In case we want to compute hash value from name to phone number. We just need to convert name to number by using ASCII or Unicode value…

We use all characters in string S to compute hash value.

def PolyHash(S, p, x) 
hash = 0
for i from |S| − 1 down to 0:
hash = (hash * x + S[i]) mod p
return hash

String search

Searching a pattern in a large text is used in widely applications. Let see how we can use hashing technique to make searching pattern faster, efficient in practice.

Given a text T (book, website, facebook profile) and a pattern P (word, phrase, sentence), find all occurrences of P in T.

The naive algorithm is going through all sub-strings of text and check is it equal to pattern.

def FindPatternNaive(T, P):
result = []
for i from 0 to |T| − |P|:
if AreEqual(T[i:(|P| − 1)], P):
result.Append(i)
return result

Checking AreEqual between 2 strings cost O(n), so FindPatternNaive will cost O(|T|.|P|).

The idea to improve it is we compare hashing value of sub-string and P first, if it has same hash value, we will call AreEqual function. If 2 strings have different hash value, these will be different. It called Rabin-Karp’s Algorithm.

def RabinKarp(T, P):
p = big prime,
x = random(1, p − 1)
result = []
pHash = PolyHash(P, p, x)
for i from 0 to |T| − |P|:
tHash = PolyHash(T[i:(|P|−1)], p, x)
if pHash != tHash:
continue
if AreEqual(T[i..i + |P| − 1], P):
result.Append(i)
return result

You see, we compute PolyHash for all sub-strings |T| — |P| times. With large prime p and x, it’s costly. To reduce that, we will precompute all hash values of all sub-strings. We will find the recurrence of hashes, then we save operation on compute new hash value.

As proof, we can compute H[i] from H[i + 1].

def PrecomputeHashes(T, |P|, p, x):
H = [] * (|T| − |P| + 1)
S = T[|T| − |P|..|T| − 1]
H[|T| − |P|] = PolyHash(S, p, x) #compute H[i+1] first
y = 1
for i from 1 to |P|:
y = (y × x) mod p # compute x^|P|, mode p to get smaller number
for i from |T| − |P| − 1 down to 0:
H[i] = (xH[i + 1] + T[i] − yT[i + |P|]) mod p
return H
def RabinKarp(T, P):
p = big prime
x = random(1, p − 1)
result = []
pHash = PolyHash(P, p, x)
H = PrecomputeHashes(T, |P|, p, x)
for i from 0 to |T| − |P|:
if pHash != H[i]:
continue
if AreEqual(T[i..i + |P| − 1], P):
result.Append(i)
return result

Problem 1: Phone book

In this problem you will implement a simple phone book manager.

Problem Description

Task. In this task your goal is to implement a simple phone book manager. It should be able to process the following types of user’s queries:

  • add number name. It means that the user adds a person with name name and phone number number to the phone book. If there exists a user with such number already, then your manager has to overwrite the corresponding name.
  • del number. It means that the manager should erase a person with number number from the phone book. If there is no such person, then it should just ignore the query.
  • find number. It means that the user looks for a person with phone number number. The manager should reply with the appropriate name, or with string “not found” (without quotes) if there is no such person in the book.

Input Format. There is a single integer 𝑁 in the first line — the number of queries. It’s followed by 𝑁 lines, each of them contains one query in the format described above.

Output Format. Print the result of each find query — the name corresponding to the phone number or “not found” (without quotes) if there is no person in the phone book with such phone number. Output one result per line in the same order as the find queries are given in the input.

Sample

Input:

12
add 911 police add 76213 Mom add 17239 Bob find 76213
find 910
find 911
del 910
del 911
find 911
find 76213
add 76213 daddy
find 76213

Output:

Mom
not found
police
not found
Mom
daddy

Explanation:

76213 is Mom’s number, 910 is not a number in the phone book, 911 is the number of police, but then it was deleted from the phone book, so the second search for 911 returned “not found”. Also, note that when the daddy was added with the same phone number 76213 as Mom’s phone number, the contact’s name was rewritten, and now search for 76213 returns “daddy” instead of “Mom”.

Use the direct addressing scheme.

Solution:

We will use an array — direct addressing table to solve this problem. Because will allocate an array with 10⁷ elements.

Problem 2: Hashing with chains

In this problem you will implement a hash table using the chaining scheme. Chaining is one of the most popular ways of implementing hash tables in practice. The hash table you will implement can be used to implement a phone book on your phone or to store the password table of your computer or web service (but don’t forget to store hashes of passwords instead of the passwords themselves, or you will get hacked!).

Problem Description

Task. In this task your goal is to implement a hash table with lists chaining. You are already given the number of buckets 𝑚 and the hash function. It is a polynomial hash function

  • add string — insert string into the table. If there is already such string in the hash table, then just ignore the query.
  • del string — remove string from the table. If there is no such string in the hash table, then just ignore the query.
  • find string — output “yes” or “no” (without quotes) depending on whether the table contains string or not.
  • check 𝑖 — output the content of the 𝑖-th list in the table. Use spaces to separate the elements of the list. If 𝑖-th list is empty, output a blank line.

When inserting a new string into a hash chain, you must insert it in the beginning of the chain.

Input Format. There is a single integer 𝑚 in the first line — the number of buckets you should have. The next line contains the number of queries 𝑁. It’s followed by 𝑁 lines, each of them contains one query in the format described above.

Output Format. Print the result of each of the find and check queries, one result per line, in the same order as these queries are given in the input.

Sample 1.

Input:

512add world 
add HellO
check 4
find World
find world
del world
check 4
del HellO
add luck
add GooD
check 2
del good

Output:

HellO world 
no
yes
HellO
GooD luck

Explanation:

The ASCII code of ’w’ is 119, for ’o’ it is 111, for ’r’ it is 114, for ’l’ it is 108, and for ’d’ it is 100. Thus, h(“world”) = (119+111×263+114×2632 +108×2633 +100×2634 mod 1 000 000 007) mod 5 = 4. It turns out that the hash value of 𝐻𝑒𝑙𝑙𝑂 is also 4. Recall that we always insert in the beginning of the chain, so after adding “world” and then “HellO” in the same chain index 4, rst goes “HellO” and then goes “world”. Of course, “World” is not found, and “world” is found, because the strings are case-sensitive, and the codes of ’W’ and ’w’ are different. After deleting “world”, only “HellO” is found in the chain 4. Similarly to “world” and “HellO”, after adding “luck” and “GooD” to the same chain 2, first goes “GooD” and then “luck”.

Follow the explanations about the chaining scheme from the lectures. Remember to always insert new strings in the beginning of the chain. Remember to output a blank line when check operation is called on an empty chain.

Some hints based on the problems encountered by learners:

  • Beware of integer over ow. Use long long type in C++ and long type in Java where appropriate. Take everything (mod 𝑝) as soon as possible while computing something (mod𝑝),so that the numbers are always between 0 and 𝑝 − 1.
  • Beware of taking negative numbers (mod 𝑝). In many programming languages, (−2)%5 != 3%5. Thus you can compute the same hash values for two strings, but when you compare them, they appear to be different. To avoid this issue, you can use such construct in the code: 𝑥 ← ((𝑎%𝑝) + 𝑝) % 𝑝 instead of just 𝑥 ← 𝑎 % 𝑝.

Solution:

Problem 3: Find pattern in text

In this problem, your goal is to implement the Rabin–Karp’s algorithm.

Problem Description

Task. In this problem your goal is to implement the Rabin–Karp’s algorithm for searching the given pattern in the given text.

Input Format. There are two strings in the input: the pattern 𝑃 and the text 𝑇 .

Output Format. Print all the positions of the occurrences of 𝑃 in 𝑇 in the ascending order. Use 0-based indexing of positions in the the text 𝑇.

Sample 1.

Input:

aba 
abacaba

Output:

0 4

Explanation:

The pattern 𝑎𝑏𝑎 can be found in positions 0 (abacaba) and 4 (abacaba) of the text 𝑎𝑏𝑎𝑐𝑎𝑏𝑎.

Solution:

Readings:

See the chapter 11 in [CLRS] — Hash

See the chapters 1.5.1 in [DPV08] — Hash

See the chapters 32.1 and 32.2 in [CLRS] — Rabin-Karp’s Algorithm.

References:

[CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.

[DPV] Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani. Algorithms (1st Edition). McGraw-Hill Higher Education. 2008.

--

--