High-Performance Python Part 2: Sets and Dictionaries

Ninad Shukla
4 min readSep 21, 2022

--

Sets and dictionaries are used in python when we do not want to maintain any ordering in our data. In these data structures, there is a reference value called ‘key’ that references the data. Sets and dictionaries are almost the same except for the fact that sets only have unique keys and no values

How do they work?

Both of these data structures use hash tables to achieve O(1) for searching and insertions. A hash table uses a function called a hash function that converts an arbitrary key into an index for a list.

Inserting

To create a hash table we first allocate some memory. Now to place the data that we get into these allocated buckets of memory the key is first hashed and masked. The process of masking makes sure that the hash which can be any integer fits in our allocated memory. Suppose we have 8 blocks of memory and our hash value is 28975 then we consider the index at 28975&0b111=7. but if we get 512 blocks the mask will become 0b111111111. The next step is to check if the index is empty or in use. If it is not in use the key-value pair is inserted but if this is not the case then a comparison is done and it is checked if the value we are inserting is already there in that bucket(using cmp built-in). If the value doesn’t match this means we have to find a new place to store the data. This is done using a modified version of linear probing in python.

Deletion

We have to write something to the empty slots that signify that they are empty. Python will try to reallocate this memory and the unused slots will be removed while resizing.

Resizing

When the table is two third full, it is grown as this is the critical point at which optimal space saving is done and we have a good bound on a number of collisions to expect. But when resizing is done a larger table has to be allocated, the mask is adjusted to fit a new table, and all the elements of the old table are reinserted into the new one, all the indexes have to be recomputed to find the new indexes. This makes resizing an expensive operation.

However, we don’t do resizing very frequently. By default, the size of a dictionary or set is 8 and on resizing it increases by 4X till the size is 50,000 and after that, it increases by 2X. So over time, the size of inserting an element becomes O(1) because there are not many resizes we have to do.

Hash Functions and entropy

Most objects in python are hashable as they have built-in functions __hash__ and __cmp__. So for integers and floats their hash value is based on their bit values. For tuples and strings, the hash value is based on their content. But the list is not hashable as its content can change.

We can also create user-defined classes that are hashable the __hash__ will return the placement of the object in the memory while __cmp__ will compare the numerical value of the object’s placement in the memory. A custom-based hash function should be careful to evenly distribute hash values so that there is a minimum collision. Having many collisions will degrade the performance. In the worst case if there are many collisions the time complexity to search might go to O(n).

This idea of how well the hash function is distributed is called the entropy of the hash function. So we need to find a proper mask. To find a mask for a dictionary with N element we must find the minimum number of buckets so that it is still two third full(N*5/3) Then we can find the smallest dictionary size that will hold the number of elements (8,32,128…..) and find the number of bits to hold this number For example, if N=1039, then we must have at least 1,731 buckets, which means we need a dictionary with 2,048 buckets. Thus, the mask is bin(2048–1) = 0b11111111111.

Dictionaries and Namespace

Whenever a variable, or function is used in python it looks at them hierarchically. Python starts by looking at the locals() array which has entries for all local variables(this part is the fastest) after that globals() dictionary is searched if it is not found __builtin__ is searched. Here locals() and globals() are dictionaries but __builtin__ is a module object.

For this reason, setting a local variable with the global reference before the loop begins might be a more legible method. The global lookup will still need to be performed once each time the function is called, but the loop’s calls to that function will all be made more quickly. This illustrates how even slight slowdowns in code can become noticeable if it is executed millions of times. A dictionary lookup might just take a few hundred nanoseconds, but if we loop millions of times over this lookup, the time might build up quickly.

References:

  1. High-Performance Python(https://www.oreilly.com/library/view/high-performance-python/9781449361747/)

--

--

Ninad Shukla

SDE @Amazon | Data Science, Machine Learning and Deep Learning Enthusiast. Contact me here: https://www.linkedin.com/in/ninad-shukla-3b97ab147/