Implementing a Bloom Filter in Java from scratch

Satvik Nema
6 min readJun 8, 2024

--

Continuing from the theoretical aspects of a bloom filter, this write-up talks about implementation of a bloom filter in Java.

Here’s a high level class diagram

Code is available here. Tests can be run here

The BloomFilter class, for starters, should be having these methods

public class BloomFilter<T> {

private final BitArray bitArray;

private final int size;

private final int hashAlgorithmsNumber;

public static <E> BloomFilter<E> create(int numberOfInsertions,
double errorThreshold) {}

private BloomFilter(int size,
int hashAlgorithmsNumber,
BitArray bitArray) {}

public void add(T element) {}

public boolean contains(T element) {}

public void serialise(String filepath) throws IOException {}

public static <E> BloomFilter<E> load(String filepath) {}
}

The create function will calculate how large the bit array will be, and how many hash function will be needed with the given number of insertions and expected error threshold [0, 1)

Will explain about the serialise and load methods in a bit. These are essentially to read and write the current bloom filter’s state to the disk, i.e serialising and deserialising.

Coming to bloom filter creation:

  public static <E> BloomFilter<E> create(int numberOfInsertions, 
double errorThreshold) {
int b = getOptimalBitArraySize(numberOfInsertions, errorThreshold);
int k = getOptimalNumberOfHashFunctions(numberOfInsertions, b);
return new BloomFilter<>(b, k, new BitArray(b));
}

private BloomFilter(int size, int hashAlgorithmsNumber, BitArray bitArray) {
this.size = size;
this.hashAlgorithmsNumber = hashAlgorithmsNumber;
this.bitArray = bitArray;
this.messageDigests = new ArrayList<>();

int added = 0;
for (HashAlgorithm hashAlgorithm : HashAlgorithm.values()) {
try {
messageDigests.add(MessageDigest.getInstance(hashAlgorithm.toString()));
} catch (NoSuchAlgorithmException e) {
System.out.println(
"Hash function "
+ hashAlgorithm
+ " appears to be missing from jdk. Trying out the next available algorithm");
continue;
}
added++;
if (added == hashAlgorithmsNumber) {
break;
}
}
}

where getOptimalBitArraySize and getOptimalNumberOfHashFunctions are defined as follows

  private static int getOptimalBitArraySize(long n, double E) {
return (int) (-n * Math.log(E) / (Math.log(2) * Math.log(2)));
}

private static int getOptimalNumberOfHashFunctions(long n, long b) {
return Math.max(1, (int) Math.round((double) b / n * Math.log(2)));
}

These mathematical derivations are explained in the last blog and can be visualised here.

For hash algorithms, I am using the enum HashAlgorithm which has all available algorithms in the jdk in java.security package

public enum HashAlgorithm {
SHA3_512("SHA3-512"),
SHA_1("SHA-1"),
SHA_384("SHA-384"),
SHA3_384("SHA3-384"),
SHA_224("SHA-224"),
SHA_512_256("SHA-512/256"),
SHA_256("SHA-256"),
MD2("MD2"),
SHA_512_224("SHA-512/224"),
SHA3_256("SHA3-256"),
SHA_512("SHA-512"),
SHA3_224("SHA3-224"),
MD5("MD5");

final String name;

HashAlgorithm(String name) {
this.name = name;
}

public String toString() {
return name;
}
}

Now coming to insertions:

  public void add(T element) {
for (MessageDigest md : messageDigests) {
int bitToSet = getBitPositionAfterHash(element, md);
bitArray.setBit(bitToSet);
}
}

private int getBitPositionAfterHash(T element, MessageDigest messageDigest) {
byte[] hashValue = messageDigest.digest(String.valueOf(element).getBytes());
messageDigest.reset();

int intValueFromBytes = new BigInteger(1, hashValue).intValue();
return Math.abs(intValueFromBytes) % size;
}

We leverage the native java.security.MessageDigest api to calculate the hash values, and then use BigInteger api to convert the byte array to an integer.

To find the index to set, mod it by size and set the corresponding position on the bit array.

contains method is also similar, but instead of setting values, only check for 0s and 1s and terminate as soon as any one of them is a 0.

  public boolean contains(T element) {
boolean result = true;
for (MessageDigest md : messageDigests) {
int bitPosition = getBitPositionAfterHash(element, md);
if (!bitArray.isBitSet(bitPosition)) {
result = false;
break;
}
}
return result;
}

I am not going into the BitArray implementation in this blog. It is simply an array of longs, where, say with 10 elements in the array, we store the data for 10 * 64 bits.

Now there might need to serialise and deserialise the bloom filter if needed. One use case is a LSM database. See appendix on why.

  public void serialise(String filepath) throws IOException {
String bitArrayPath = filepath + "_bit_array";

bitArray.serialise(bitArrayPath);

try (BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(filepath))) {
bos.write(Conversion.toBytes(size));
bos.write(Conversion.toBytes(hashAlgorithmsNumber));
}
}

public static <E> BloomFilter<E> load(String filepath) throws IOException {
String bitArrayPath = filepath + "_bit_array";

BitArray loadedBitArray = BitArray.load(bitArrayPath);

int size;
int numberOfHashes;
try (BufferedInputStream bis = new BufferedInputStream(new FileInputStream(filepath))) {
byte[] bytes = bis.readNBytes(Integer.BYTES);
size = Conversion.toInteger(bytes);

bytes = bis.readNBytes(Integer.BYTES);
numberOfHashes = Conversion.toInteger(bytes);
}
return create(size, numberOfHashes, loadedBitArray);
}

First serialise the bit array in a separate location, and then write the size and number of hash function separately. Both the bit array and the int values could be written on the same file, but I wanted the bit array’s implementation to be separate from the bloom filter. You can check the bit array’s serialisation/deserialisation here

What is Conversion.toInteger and Conversion.toBytes ? To write an integer to a file, the most efficient way it to write it as byte array.

Say you need to write the integer 2147483647 in a file. If you write it as:

bos.write(2147483647+"\n");

It writes the value as a String, taking up 1 byte for each character. Coming to 11 * 1 byte = 11 bytes or 88 bits. On the other hand if you convert 2147483647 into bytes and write the bytes on the file, it will write it as 127, -1, -1, -1 which comes out to be 4 * 1 byte = 4 bytes or 24 bits. Thats saving around 64 bits

Here it may not make a difference with only 2 integers to write. But with BitArray, 100s of long values are being written in the file.

Say with the long value 9223372036854775807, when writing as string, it consumes 19 bytes but when writing as byte array, it only consumes8 bytes => 127, -1, -1, -1, -1, -1, -1, -1 . That’s a saving of 11 bytes per element

Bloom filters with moderate sizes may get up to a million bits in length, so a saving of ~11,000,000 bytes. We save ~11 MB with this little optimisation!

Here’s how the conversion functions look like:

  public static byte[] toBytes(Long val) {
int bytesInALong = Long.BYTES;
byte[] result = new byte[bytesInALong];

int mask = (1 << (Byte.SIZE + 1)) - 1; // 8 bit mask with -> 11111111
for (int i = bytesInALong - 1; i >= 0; i--) {
byte end8Bits = (byte) (val & mask);
result[i] = end8Bits;
val = val >> Byte.SIZE;
}
return result;
}

public static Long toLong(byte[] arr) {
int bytesInALong = Long.BYTES;
if (arr.length > bytesInALong) {
throw new ArithmeticException(
"Cannot convert bytes to long as byte array has more elements than 1 long can handle");
}
long result = 0L;
for (int i = 0; i < bytesInALong; i++) {
result = result << Byte.SIZE;

int b = arr[i] & 0xFF;
result = result | b;
}
return result;
}

Similarly for integers, here is the implementation

Time to test the implementation

For testing, the values are taken from https://hur.st/bloomfilter.

With a 100k insertions and expected threshold to be 0.001

  @Test
@DisplayName(
"verify that integer bloom filter works correctly. When elements not present. false positives expected")
void testCreate_1() {
BloomFilter<Integer> bloomFilter = BloomFilter.create(100_000, 0.001);

int toInsert = 100000;
for (int i = 0; i < toInsert; i++) {
bloomFilter.add(i);
}

double errors = 0;
double success = 0;
for (int i = toInsert; i < toInsert * 3; i++) {
boolean result = bloomFilter.contains(i);
if (result) {
errors++;
} else {
success++;
}
}
System.out.println("Success: " + success);
System.out.println("Errors: " + errors);

double errorRate = (double) ((int) ((errors / success) * ROUND_PLACES)) / ROUND_PLACES;
System.out.println("Error rate: " + errorRate);
}

And the output is:

Success: 199809.0
Errors: 191.0
Error rate: 9.5E-4

Error rate is 0.00095 , which matches the expectations!

Similarly, there should be assertions for 100% guaranteed NOs:

  @Test
@DisplayName(
"verify that integer bloom filter works correctly. When elements present. No false positives expected")
void testCreate_2() {
BloomFilter<Integer> bloomFilter = BloomFilter.create(100_000, 0.001);

int toInsert = 100000;
for (int i = 0; i < toInsert; i++) {
bloomFilter.add(i);
}
for (int i = 0; i < toInsert; i++) {
assertTrue(bloomFilter.contains(i));
}
}

You can find the tests for serialisation/deserialisation here

Appendix

Serialising and deserialising of a bloom filter is very much important for a LSM database, where each memtable has it’s own bloom filter.

It’s possible that for a memtable on disk, it’s bloom filter is also on the disk and not loaded on the memory. For querying the memtable, there’s a need to first check it’s bloom filter. To do that, we use the load method.

When writing a fully saturated memtable onto the disk, what will happen to the bloom filter in memory? That will go alongside the memtable on the disk. For that, use the serialise method.

You can check how the interaction between the bloom filter works with a simple LSM database. That integration is part of the same repository here: https://github.com/SatvikNema/SatvikDB/commit/71be22414dc1a159082908d5b383a0862aebc2c5

References

  1. https://medium.com/@satviknema/what-the-heck-is-a-bloom-filter-anyway-ee392818f2b2
  2. https://github.com/SatvikNema/bloom-filter
  3. https://hur.st/bloomfilter/
  4. https://github.com/SatvikNema/SatvikDB/tree/main/src/main/java/com/satvik/satvikdb

--

--