What are Counting Bloom Filters? — Python Implementation

Kareem Alsayed
Analytics Vidhya
Published in
8 min readFeb 12, 2020
An example of “Counting Bloom Filters”

Before talking about Counting Bloom Filters, I will give a brief overview of Bloom Filters and its main highlights. That’s because Counting Bloom Filters functions deviate from the Bloom Filters algorithm functions. Afterward, I will discuss the main functions of the Counting Bloom Filters in detail and specify how it is different from Bloom Filters.

Bloom Filters

Overview

Bloom Filter is a probabilistic data structure used mainly for checking if an element exists in a set or not. It is known to be space-efficient as it uses only bits for storing the data. Also, one of the Bloom Filter’s advantages is time-efficiency because it takes a constant time complexity to add and look-up an element. Bloom Filters depend on hashing functions to assign elements into slots. For looking up an element, Bloom Filters take an input of the element, and they return if the element in the data structure or not. However, the positive output is not always correct, which means the data structure may return that an element exists, but it doesn’t. There are some versions of Bloom Filters; one of them is the “Counting Bloom Filters” that will be the focus of the article. That article will discuss the main functions of the Counting Bloom Filters and some probabilistic equations related to it.

Main highlights

  • Bloom Filters have no false-negative output which means that it will never say that a specific value exists when it doesn’t really exist.
  • It has false-positive probability which means that it may say that something exists while it doesn’t. That problem happens when the outputs of the hash functions overlap for different elements.
  • Bloom Filters use a size of bit (0 or 1) for storing marking one slot.
  • In initialization, Bloom Filters have a specific size (m) of zeros, and that size affects the probability of getting a false positive.

I would recommend reading that article for more information about Bloom Filters

Counting Bloom Filters

Counting Bloom Filters use the same functions of Bloom Filters, but the main difference is that in Counting Bloom Filters, we add counters in every slot instead of (0 or 1). That means we can delete elements from the data structure which is not possible in Bloom Filters.

Counting Bloom Filters’ functions:

1- Initialization

To initialize Counting Bloom Filters, we should create an array of m slots where every slot will hold the value of a counter. In the beginning, all the slots will equal to zero as there are no elements added in the data structure.

Creating an empty Counting Bloom Filter with the size 10

It is essential to point out here that Counting Bloom Filter initializes a fixed-size array (m). So, it has a constant space complexity O(1). However, if it has flexible sizing as time passes, it will lose the advantage of space efficiency.

2- Hashing

To understand how Counting Bloom Filters store data, you need to know how hashing functions work.

A hash function takes an input of data and returns a unique identifier of fixed length which is used for identification of input.

I would recommend reading that article for more details.

For finding the key to each element in the array, we will use the hash functions. Hashing will be used mainly into three methods, which are adding an element, removing an element, and looking up for elements.

Every element will be assigned to more than one slot in the data structure. Thus, every element will pass through more than one hashing function to find the corresponding hash code in the data structure. The number of hashing functions used will be represented with the letter k. Finding the best value for k will be explained in the “false positive probability” section below.

In the Python implementation, I will use the Python built-in library for hashing (hashlib) since it abstracts effective hashing algorithms.

3- Adding

To add an element, we need first to use the hash functions to find the keys of the element, and then we should get the remainder (modules function) of the output from the array size. So, the formula will be the following: key = hᵢ(element) % m where hᵢ() is a hash function, and m is the size of the array.

After that, we should go to slots where the element keys point, and then we increase the value of their counters by one. Here, we increase the counters’ value by one because we are using Counting Bloom Filters. On the other hand, if we are using Bloom Filters, we would change these slots to be equal to one even if the element is not the first in the slot.

Example:

If we are building a data structure for video games, we will start by initializing the array. Then, we will push the first game “Battlefield” in the data structure as shown in the picture below:

Adding “Battlefield”

After that, we can add the second game “GTA” using the same hash function as shown in the picture below:

Adding “GTA”

Here, we can notice that slot number 3 increases by 1 since the two hash functions output are intersecting there.

4- Remove

Removing an element from the data structure has a similar approach to adding. Hence to remove an element, we first need to find the keys of the element using the hash functions. Afterward, we will go to the slots with the keys found, and then we should decrease these slots values by one.

Example:

Getting back to the previous example. Now if we need to remove the video game “Battlefield”, we will first go to the slots with correspondent keys, and then decrease their values by one as shown in the picture below:

Removing “Battlefield”

5- Lookup:

Looking up an element in the data structure takes a similar approach as the remove function and add functions. First of all, we will use the hashing functions that return the keys of the element using the following formula: key= hᵢ(element) % m where hᵢ() is a hash function, and m is the size of the array.

Then, we will go to every slot with the element keys and check its value. If the value of the counter equals zero, it means that the element doesn’t exist. However, if the counter value is greater than zero, it means that we need to check the other slots. If all the slots with the element keys are greater than zero, then it means that elements exist in the data structure, otherwise the element doesn’t exist. Sometimes the lookup function returns false-positive results; that means the look-up function says that the element exists, but it is not there.

Example :

Returning to our previous example, we will try to check whether the video game “Battlefield” still exists or not. First, we will go to the slots with “Battlefield” keys, which were (0, 3, 6). We will find that the value of (0 and 6) is zero. That means “Battlefield” doesn’t exist in the data structure, as shown in the picture below:

Looking for “Battlefield”

The full python class (CBFs) implementation can be found that gist

False Positive probability:

Not always, the lookup function returns the correct positive output, which means there is a rate for error called false positive probability. The false-positive probability is computed experimentally by dividing the positive false times by the total number of positive results. Also, another way for computing it, theoretically, is by using the following formula: P=(1-[1-(1/m)]ⁿᵏ)ᵏ where P is the false Positive probability, m is the size of the array, n is the expected number of elements, and k is the number of hash functions used.

That’s why some other equations exist that try to find the optimal number of hashing functions (k). One equation is k=(m/n)× ln (2) where k is the optimal number of hashing functions, m is the size of the array, and n is the estimated number of elements.

Also, another equation is used to find the best size for the array given the probability and the number of expected elements. That equation is the following: m=-(n×ln (P))/(ln (2))² where m is the array size, n is the number of the estimated elements, and P is the false positive probability.

Thus, the parameters of the CBFs class will be the estimated number of elements and the false positive probability desired; after that, the code can calculate the best array size and the optimal number of hashing functions.

False-negative:

Counting Bloom Filters don’t have a false-negative result, which means that if the output is false, then it is 100% false. That’s because if one of the slots is zero, it means that no elements touched that counter.

Complexity analysis:

Computational applications:

  • Checking the availability of a unique username:

The Counting Bloom Filters can be used when signing up on a website, and we choose a specific username, the website checks if that the username is taken or not. It is a good application here because if Counting Bloom Filters return that the username is unique, then it is 100% the username is unique as the false-negative rate is zero. Also, if someone is deleting his account on that website, then Counting Bloom Filters will give us the ability to remove that username from the database.

  • Keycard authentication

Many companies are using the keycard entry system for access control. So, allowing the keycard holder inside shouldn’t take time, and the process should be accurate to avoid non-authorized people from getting in. Thus, Counting Bloom Filters would be a great choice for solving that problem as it has constant time-complexity and zero false-negative probability.

Thanks for reading! Open for any feedback :D

--

--