Bloom Filter Mathematical Proof

Humberto Villalta
5 min readApr 5, 2023

--

1. False Positive Rate Proof

1.1 Bit “0” Probability Proof

Let the next expression define the probability of a bit being 0 when having a bit array size of m bits. In contrast, if we only take -1/m, the probability of that bit being 1 will be calculated

When having k hash functions number, the probability of a bit being zero is the product of 1-1/m multiplied k times. Therefore, we get the next two equations:

  • Let k be in [1,~] numbers, therefore:

Samely, if there are n times insertions, the probability of that bit being 0 after k times hash functions is the product of the (1-1/m)^k multiply n times. Therefore, we can conclude the next:

  • Let k and n be in [1,~] numbers, therefore:

1.2 Probability Proof of “False Positive”

The probability of a False Positive according to section 1 probability of having a bit not being 0 can be expressed as 1- “being 0 probability”:

However, k hash functions are implemented. Therefore, the above expression has to be multiplied by k times to calculate the probability of having a False Positive. Next is proven how the false positive rate is being calculated:

  • Let k be in [1,~] numbers, therefore:

$$$$$Final EquationProof Below$$$$$

Therefore, the false positive rate final formula can be expressed as next:

1.3 (1-1/m)^kn and e^(-kn/m) Relationship

Let the next equations be defined as approximate equals:

To calculate their equality, it is important to simplify both equations.

  • Step 1: k and n are canceled on both sides.
  • Step 2: Elevate both sides to m and canceled the Euler expression m to have as a result -1
  • Step 3: Express Euler as 1/e

Thus, if we calculate the limit of the above result is expressed as next:

  • As seen in the graph and equation below, the bigger the bit array size m is, the closer gets to approximate 1/e.

Therefore, the below equation expresses an approximate equality between the two.

2. Optimal Hash Function Number k

According to the false positive proven equation, we can summarize as next:

2.1 Derivating f

To find the optimal k, we need to find the derivate of the next equation:

When derivating ln, the chain rule must be applied because ln(x)’ = 1/x → x = 1-e^(-kn/m)

2.2 Finding the Optimal k

The formula in section 2.1 step 5, after solving k when equalizing to 0, the next equation is the optimal solution to calculate the number of hash functions in the bloom filter, and therefore, k must be cleared to find the optimal solution:

Euler’s number, also known as e, is a mathematical constant that is approximately equal to 2.71828. It is an irrational number, meaning that it cannot be expressed as a finite decimal or a fraction. A natural logarithm is a type of logarithm used to calculate the time it takes for an exponential function to grow or decay. The natural logarithm is written as ln and is the inverse function of the exponential function, which is written as:

The equation in step 10 has natural logarithms on both sides. However, in this equation when evaluated to “0”, the left side is zero but the right size has a mathematical issue because x cannot be equal to or lower than zero in ln(x). Therefore we get 0.5 as a result below and can clear the expression by replacing x with 0.5 because e is a constant.

3. Optimal Bit Array Number m

To find the optimal m, we need to clear m in terms of p and n for the next equation:

References

  1. Logarithm Rules
  2. Bloom Filter Video Math Proof
  3. Bloom Filter Berkeley University Math Proof
  4. Derivate Products
  5. Exponent Rules
  6. Bloom Filter Calculator
  7. Bloom Filter Survey
  8. Bloom Filter John Hopkins School

--

--

Humberto Villalta

I am a BigData Engineer who has a lot of interest in the research and development of data structures for the optimization of NoSQL Databases.