Hash Tables

Sounds familiar? but we’re going to take a closer look.

A hash table is a data structure used to implement symbol table (associative array), a structure that can map keys to values. — Wikipedia.
When a hash table is created, it’s really internally a very array-based data structure, and can usually be created with an initial capacity. The hash table consists of an array of slots, or buckets which contain the values — Diving Into Data Structures.
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. — Wikipedia
A small phone book as a hash table — wikipedia

Hashing

The idea of hashing is fundamental to understand the hash table data structure. So, what’s hashing anyway?

Hashing is a way to take our data and run it through a function, which will return a small, simplified reference generated from that original data. — Diving Into Data Structures
Hashing is one way; you can’t convert the hashed data into it’s original version.

Thus, we can take a complex object, and hash it down to a single integer representation used as an index in the array of buckets. This is what we are going to explore next; hash function.

Hash Function

It takes the key and produce an integer (used as an array index). Computing hash function can be complicated for complicated types of data.

Time and Space Tradeoffs

  • The more space you have, the trivial the hash function it can be. For example, you can use keys itself as an index if they are for example integers, characters, …etc.
  • The more time you have, the trivial the hash function and the collision resolution it can be. For example, you can hash every key to the same integer(index), and do a sequential search.

In real world, we have limitation on time and space.

Collision resolution is handling two different keys that are hashed to the same integer (will be explained later).

Computing The Hash Function

Our goal is to take any key and hash it to an array index. We have three primary requirements in implementing a good hash function for a data type:

  1. Equal keys must produce the same hash value.
  2. Hashing in efficient amount of time.
  3. Each key has the same probability to be hashed into each array index. This is what’s called “Uniform Hashing Assumption” (will be explained later).

In the real world, we will develop a hash function for every key type. In Java, types like Strings, and Integers already have their own hash function implemented.

Java’s Hash Function

All Java classes inherits hashCode() method, which returns 32-bit integer.

Java requires that if x.equals(y), then their hashCode() should be equal too. That is, if a.equals(b) is true, then a.hashCode() must have the same numerical value as b.hashCode().

It’s highly desirable that if !x.equals(y), then their hashCode() is not equal, but, you can’t always guarantee that; collision may occur.

The default implementation returns memory address of the current object. Java has customized implementation for standard data types, such as:

— Integer

It returns the value of the integer itself.

public final class Integer {
private final int value;
// ...
    public int hashCode() { return value; }
}

— Boolean

It returns 1231 if true, and 1237 if false.

public final class Boolean {
private final boolean value;
// ...
    public int hashCode() { 
if (value) return 1231;
else return 1237;
}

}

— Double

It returns a 32-bit integer after alternating the binary representation of the double value (64-bit).

public final class Double {
private final double value;
// ...
    public int hashCode() { 
long bits = doubleToLongBits(value);
return (int) (bits ^ (bits >>> 32));
}

}

— String

Returns a hash code for this string. The hash code for a String object is computed as: s[0]*31^(n-1) + s[1]*31^(n-2) + … + s[n-1].

public final class String { 
private final char[] s;
private int hash = 0; // cache the hashed value
// ...

public int hashCode() {
int h = hash;
if (h != 0) return h;
// return the cached value
for (int i = 0; i < length(); i++)
h = s[i] + (31 * h);
hash = h;
// store the hashed value
return hash;
}

}
The hash value of the empty string is zero.

User-defined Hash Function

We can also implement a hash function for user’s defined data types.

We will use Transaction class as an example, which represents commercial transaction with a customer name, date, and amount.

public final class Transaction { 
private final String who;
private final Date when;
private final double amount;
// ...
    public int hashCode() { 
int hash = 17; // start with a small prime number.
hash = 31*hash + who.hashCode();
hash = 31*hash + when.hashCode();
hash = 31*hash + ((Double) amount).hashCode();
return hash;
}
}

When implementing the hashCode() method for user-defined data types, you need to follow these instructions:

  • Combine the fields using (31 * x) + y rule, where x is the current hash value of data type, and y is hash value of current filed.
  • Include every object’s field in hashing:
  • — If field is reference type, use hashCode().
  • — If field is primitive type, use hashCode() of wrapper types.
  • — If field is null, return 0.
  • — If field is an array, apply it to each entry, or, use Arrays.deepHashCode().

Modular Hashing

It’s converting a hashCode() to an array index between 0 and M-1, since hashCode() returns a value between -2³¹ and (2³¹-1).

M is the size of the array of hash table, it’s typically a prime or power of 2.

— 1st Try

Computing the hash value modulo M may result in a bug, as hashCode() may return a negative integer.

private int hash(Key key) { return key.hashCode() % M; }

— 2nd Try

Even with using Math.abs() it can return a negative integer if hashCode() returned (-2³¹), Why? because Math.abs() returns an integer value, and the absolute value of (-2³¹) is 2³¹, which is outside the range of 32-bit signed integers.

private int hash(Key key) { return Math.abs(key.hashCode()) % M; }

— 3rd Try

This is the correct way. Get a positive hash value by changing the most significant bit to zero (if not already), then modulo M.

private int hash(Key key) { 
return (key.hashCode() & 0x7fffffff) % M;
}
0x7ffffff equals to 2³¹ -1, which has binary representation of: 0111…111 (32 bits), and most significant bit used to indicate whether the number is +ve or -ve.

Uniform Hashing Assumption

To analyze our hashing algorithms and develop hypotheses about their performance, we make the following assumption.

Each key is equally likely(has the same probability) to be hashed to an integer between 0 and M-1. Therefore, keys will be distributed uniformly across the array entries.

Collisions

Collisions is having two distinct keys hashing to same index. Collisions in hash tables are unavoidable. Therefore, almost all hash table implementations have some collision resolution strategy to handle such event.

Collisions in Hash Tables — algs4.cs.princeton.edu

Separate Chaining & Linear Probing are the most common ways to resolve the collision problem. And, we will go through hash table implementation using each of these strategies.


Hash Tables — Using Separate Chaining

We are going to implement the symbol tables using hash tables with separate chaining method.

In Separate Chaining, we maintain for each array entry, a linked list of the key-value pairs, in a hash table of size M, where M< N; the number of key-value pairs (not always the case, for example, initially M has to be greater than N).

Hash Tables With Separate Chaining — algs4.cs.princeton.edu

When we search or insert, we first hash to find the list that could contain the key, then sequentially search through that list for the key, and if we want to insert, we insert to the beginning of the list.

The main operations of a hash table using separate chaining technique would be as usual; get, put, delete, contains & isEmpty.

Implementation

We’ll start by the hash table class. It has an array of linked-lists, each entry is a reference to the first node in the list. It also has an integer N represents the number of key-value pairs, and the hash table size M.

The Key and Value type is any generic type. It’s required for the key type to override the equals() and hashCode() methods.

public class SeparateChainingHashST<Key, Value> {
private static final int INIT_CAPACITY = 4;
    private int N;        // number of key-value pairs
private int M; // hash table size
private Node[] st; // array of linked-list

public SeparateChainingHashST() {
this(INIT_CAPACITY);
}
public SeparateChainingHashST(int capacity) {
this.N = 0;
this.M = capacity;
this.st = new Node[M];
}
}

Then, we have an inner class called “Node” to represent each node within the linked list. Each node has a key, a value, and a reference to the next node.

private static class Node {
private Object key;
private Object val;
private Node next;

public Node(Object key, Object val, Node next) {
this.key = key;
this.val = val;
this.next = next;
}
}
Java doesn’t allow generic array creation; we can’t declare array of Nodes, where each node has properties of generic types. So, use “Object” type instead of “Key” & “Value” generic types, and do casting when you trigger these properties.

— get

It needs to search only the i-th linked list.

// return value associated with key, null if no such key
public Value get(Key key) {
int i = hash(key);
for (Node x = st[i]; x != null; x = x.next)
if (key.equals(x.key))
return (Value) x.val; // do casting to generic type
return null;
}

— put

Insert at the front of the i-th linked list (if not already exists).

// insert key-value pair into the table
public void put(Key key, Value val) {
if (val == null) { // remove key from table if value is null
delete(key);
return;
}
    // double table size if average length of a list (N/M) >= 8
// note: we make sure average length is between 5 & 8
if (N >= 8*M) resize(2*M);
    int i = hash(key);
for (Node x = st[i]; x != null; x = x.next) {
// if key exists, overrides the old value (avoid duplicates)
if (key.equals(x.key)) {
x.val = val;
return;
}
}
// add a new node to the beginning of the list at index i
st[i] = new Node(key, val, st[i]);
N++;
}

— delete

Scan all the keys in the i-th linked list until we find a match, and if there is a match, delete the key and its value from the list. This is done recursively.

Deletion in a separate-chaining hash table — algs4.cs.princeton.edu
// removes the key and its value if the key exists
public void delete(Key key) {
int i = hash(key);
s[i] = delete(s[i], key);
    // halve table size if average length of list (N/M) <= 2
// note: M must be greater than init capacity (small)

if (M > INIT_CAPACITY && N <= 2*M) resize(M/2);
}
// delete key in linked list starting from Node x
private Node delete(Node x, Key key) {
if (x == null) return null;
if (key.equals(x.key)) {
N--;
return x.next;
}
x.next = delete(x.next, key);
return x;
}

— resize

Resize the hash table to given capacity, and rehashing all the keys. Rehashing all the keys is necessary to redistribute the keys over the linked lists.

Hence, both, resizing and rehashing ensure that the average number of keys (N/M) in a list is constant (see analysis below).

Resizing in a separate-chaining hash table — algs4.cs.princeton.edu
private void resize(int capacity) {
SeparateChainingHashST<Key, Value> temp = new SeparateChainingHashST<Key, Value>(capacity);
for (int i = 0; i < M; i++)
for (Node x = st[i]; x != null; x = x.next)
temp.put((Key) x.key, (Value) x.val);
this.M = temp.M;
this.N = temp.N;
this.st = temp.st;
}
When we rehash, the value of hashCode() does not change but hash() can change.

— contains

// Is the given key exists?
public boolean contains(Key key) { return get(key) != null; }

— Is the table empty?

// Is the table empty?
public boolean isEmpty() { return N == 0; }

Analysis (Under Uniform Hashing Assumption)

The number of keys in a list is ~= N/M (load factor). Therefore, the number of compares (equality tests) for search and insert is proportional to N/M.

If M too large, therefore too many empty indexes, and if too small, lists will be too long. Typically the load factor (N/M) should be ~= 5 (constant).

— Load Factor

  • As the load factor grows larger, the hash table becomes slower, and the expected constant time property of a hash table for a fixed size array is not achieved.
  • A low load factor is not especially beneficial. As the load factor approaches 0, this is not necessarily result in any reduction in search cost, and, the number of unused areas in the hash table increases. This results in wasted memory.

That’s why we use a resizeable array to make sure that M is always within a constant factor of N; meaning, N/M is always constant (~=5).

The bottom line is (Under uniform hashing assumption), It takes O(N) in worst case, and 3–5(constant time) in average, and there is no ordered iteration; keys aren’t in order unlike in BSTs.


Hash Tables — Using Linear Probing

This is another method used to resolve the collision problem.

In Linear Probing, we store N key-value pairs in a hash table of size M > N.

Hash Tables With Linear Probing — algs4.cs.princeton.edu

When we search or insert, we first hash to find the index that could contain the key, then there are three possible cases:

  • key equal to search key (search hit).
  • empty position (search miss).
  • key not equal to search key, so we try next entry.

And if we want to insert, we find the next empty slot, and put it there.

The main operations of a hash table using linear probing technique would be as usual; get, put, delete, contains & isEmpty.

Implementation

We’ll start by the hash table class, but, this time using linear probing.

It has two arrays, one for keys, and the other for values. It also has an integer N represents the number of key-value pairs, and the hash table size M.

The Key and Value type is any generic type. It’s required for the key type to override the equals() and hashCode() methods.

public class LinearProbingHashST<Key, Value> {
private static final int INIT_CAPACITY = 4;

private int N; // number of key-value pairs
private int M; // size of hash table
private Key[] keys;
private Value[] vals;

public LinearProbingHashST() {
this(INIT_CAPACITY);
}
public LinearProbingHashST(int capacity) {
this.N = 0;
this.M = capacity;
keys = (Key[]) new Object[M];
vals = (Value[]) new Object[M];
}
}
Array size M must always be greater than number of existing N key-value pairs in the array. So, we need to resize the array as it gets full.

— get

Search at index i, if not free but no match, try i+1, i+2, and so on. Whenever you find an empty index, this means the key doesn’t exist in the array.

// return value associated with key, null if no such key
public Value get(Key key) {
for (int i = hash(key); keys[i] != null; i = (i + 1) % M)
if (keys[i].equals(key))
return vals[i];
return null;
}
In linear probing, whenever you reach the end of the array, start from the beginning.

— put

Insert at index i if free; if not, try i+1, i+2 and so on (if not already exists).

// insert key-value pair into the table
public void put(Key key, Value val) {
if (val == null) { // remove key from table if value is null
delete(key);
return;
}
    // double table size if it's 50% full or more
if (N >= M/2) resize(2*M);
    int i;
for (i = hash(key); keys[i] != null; i = (i + 1) % m) {
// if key exists, overrides the old value (avoid duplicates)
if (keys[i].equals(key)) {
vals[i] = val;
return;
}
}
// add a new key-value pair at index i
keys[i] = key;
vals[i] = val;
N++;
}

— delete

The delete operation is a little bit tricky. You start by checking if key already exists or not. If exists, then find the index of that key in the array.

As you’ve noticed in put, if a key is to be inserted at index i, which is full, it will be inserted at the next empty position in the array. So, that’s why we need to loop starting from index i till we find the key (assuming it exists).

The last thing is, if a key is to be inserted at index i, but it was inserted at index i+2, since indexes i & i+1 were full. Now, whenever we remove the key at index i, we need to re-insert all key-value pairs following the key at index i.

This is because of the phenomena of clustering (see below); having a contiguous block of items, and even worse, when a block gets merged with another.

Deletion in a linear-probing hash table — algs4.cs.princeton.edu
// removes the key and its value if the key exists
public void delete(Key key) {
if (!contains(key)) return;

// find position i of key
int i = hash(key);
while (!key.equals(keys[i])) {
i = (i + 1) % M;
}

// delete key and associated value
keys[i] = null;
vals[i] = null;

// rehash all keys in same cluster (see definition below)
i = (i + 1) % M;
while (keys[i] != null) {
// delete keys[i] an vals[i] and reinsert
Key keyToRehash = keys[i];
Value valToRehash = vals[i];
keys[i] = null;
vals[i] = null;
N--;
put(keyToRehash, valToRehash);
i = (i + 1) % m;
}
N--;
    // halves size of array if it's 12.5% (1/8) full or less
if (N > 0 && N <= M/8) resize(M/2);
}

— resize

Resize the hash table to given capacity, and rehashing all the keys. Rehashing all the keys is necessary to redistribute the keys over the linked lists.

Hence, both, resizing and rehashing ensure that the average number of keys in the array is always ⅛ ≤ (N/M) ≤ ½ (see analysis below).

Resizing in a linear-probing hash table — algs4.cs.princeton.edu
private void resize(int capacity) {
LinearProbingHashST<Key, Value> temp = new LinearProbingHashST<Key, Value>(capacity);
for (int i = 0; i < M; i++)
if (keys[i] != null)
temp.put(keys[i], vals[i]);
this.keys = temp.keys;
this.vals = temp.vals;
this.M = temp.M;
}

— contains

// Is the given key exists?
public boolean contains(Key key) { return get(key) != null; }

— Is the table empty?

// Is the table empty?
public boolean isEmpty() { return N == 0; }

Cluster

It’s a contiguous block of items. It happens when the key finds it’s index is already full, then it search for the next empty array entry. It’s worse when cluster merge with another cluster. So, insertion operation will take more time.

Knuth was able to show if N = M/2 , then mean displacement is ~ 3/2, but, as the array gets full, it’s much higher.

The displacement is the number of steps to find a free index in array starting from index i = hash(key).

Analysis (Under Uniform Hashing Assumption)

The average number of calls to equals() & hashCode() methods for a hash table of size M, and contains N keys in search hit is about 3/2, and search miss is about 5/2 — Proved by Knuth.

If M too large, therefore to many spaces, and If M too small, search will take much time because of clusters. Typically load factor (N/M) should be ~= 1/2 (half full).

In fact, even with good hash functions, their performance dramatically degrades when the load factor grows beyond 0.7 or so— Wikipedia

So, if we can keep the array about half full, we get a constant time for search hit and search miss as just mentioned. That’s why we use a resizing array to make sure that the array size is always about half full.

The bottom line is (Under uniform hashing assumption), It takes O(N) in worst case, and 3–5(constant time) in average, and there is no ordered iteration.


Separate Chaining Vs Linear Probing

Separate Chaining Vs Linear Probing — algs4.cs.princeton.edu

Implementation

Separate chaining is easier to implement insert and delete.

Clustering

Clustering in separate chaining is less problem even with bad hash function.

Space

Linear probing has less wasted space; no linked lists, no pointers to next node as in separate chaining.

But, linear probing only saves memory if the entries are small and the load factor is not too small. If the load factor is close to zero (N/M ~= 0), linear probing is wasteful.

Caching

Linear probing has better cache performance. Traversing a linked list has poor cache performance, making the processor cache ineffective.

The Load Factor (N/M)

In separate chaining as load factor increase, lists will be too long, and as it decrease, too many empty indexes. So, the best value for load factor is to be constant (typically ~= 5).

In linear probing as load factor increase, it gets very slow because clusters will cause you to search through many array entries. And as it decrease, it’s faster than separate chained because you don’t have to follow pointers between list nodes.

Separate Chaining & Linear Probing Variants

Two-probe Hashing

It’s a separate chaining variant, where we hash the key into two positions, and choose insert into the position with list of shorter length.

It reduces the length of longest list to Log(LogN) in worst case.

Double Hashing

It’s a linear probing variant, where i is incremented by skip when inserting, instead of 1. The value of skip is determined by another hash function.

It effectively eliminates clustering, but, more difficult to implement delete.

Cuckoo hashing

It’s a linear probing variant, where we hash the key into two positions, and insert into either position; if not free, try the alternative position, if still not free, try i+1 of the first position, and so on. It gives constant worst case time for search.

Hash Tables Vs BSTs

Hash Tables

  • Simpler to code
  • The keys aren’t ordered
  • Fast for simple keys
  • Better system support. For example, In Java, no need to compute the hash for standard data types.

BSTs

  • Stronger performance guarantee
  • Support for ordered operations
  • Easier to implement compareTo() correctly than equals() and hashCode().

Java Includes …

  • Red-Black BSTs: TreeMap, TreeSet.
     — TreeMap is more space-efficient.
     — TreeMap only works with Comparable objects.
  • Hash Tables: HashMap, IdentityHashMap.
     — HashMap will be more efficient in general, so use it whenever you don’t care about the order of the keys.
     — HashMap is more time-efficient. 
     — HashMap only works with objects with a suitable hashCode() implementation.
     — In HashMap, when there is a collision, a single bucket stores a list of entries.

Summary: Symbol Table Implementations

Symbol Table Implementations Analysis — algs4.cs.princeton.edu