Creating a Bloom Filter with Go

Dylan Meeus
5 min readNov 23, 2019

--

bloom filter example: (https://en.wikipedia.org/wiki/Bloom_filter)

A bloom filter is a set-like data structure that is more space-efficient compared to traditional set-like data structures such as hash tables or trees. The catch is that it is probabilistic, meaning it can not give a 100% certain answer to certain queries.

More specifically, a bloom filter can tell you with 100% certainty that something is not in the set. But it can not tell you with 100% certainty that something is in the set.

Use cases

If you already know why you’d want to use a bloom filter, just skip the next part below 😃. So why would we want to use a set like this? One way bloom filters are used is to serve as a type of ‘cache’ for large data sets, where you expect that most queries return that an item is not in the set.

So instead of querying the large set directly, you first query the bloom filter. In most cases, the bloom filter tells us that an item is absent and that’s what we care about. And when an item might be present you query the actual data set directly.

You probably interact with them quite often. Let’s look at two examples

Medium

Medium is using bloom filters to avoid recommending articles that a user has already read.

Google Bigtable

Google Bigtable uses bloom filters to reduce disk lookups for non-existent rows or columns. If you’re 100% sure an item is not in Bigtable, your query will never need to read the actual data, thus saving (costly) disk lookups.

Google Chrome

Chrome uses a bloom filter to check a URL against a list of known malicious URLs. The first check is a (local) bloom filter, and only when this returns a positive result will an actual full-check be performed. The user only gets a notification if this check is also true.

How do they work?

To avoid using a lot of space, a bloom filter is actually an array of bits. Each item you pass to your bloom filter gets mapped to a location in the bloom filter, by hashing the incoming item. (As in the image at the start of this post)

When you later query your bloom filter, you hash the input and turn the hash into a location inside the bitfield. Next you check if the corresponding bits are set to 1 or 0. If they have been set to 1, than the item is in the set.

The hashing of the input is the reason why you can be certain of absence, but not certain of presence. If a bit is 0 you’re certain that nothing ever ended up on that index. But if it’s 1, it could be due a ‘hash collision’ or simply because multiple hashes mapped to the same position. Don’t worry, this will become clear in the implementation.

Basic Go implementation

For our Go implementation, we’ll steal the example of Chrome given above. We are storing string (urls) and when a user requests a URL we need to make sure it’s not malicious. If the entry is not in our bloom filter, the URL is not malicious and won’t harm our user.

It’s a toy example, so we’ll omit the steps of querying an actual database and just focus on the core part of the bloom filter. As mentioned, often you’d make sure a positive result is not a false positive by performing a query against the actual data.

For the most basic implementation we need a data structure to hold a bitset, a function for setting and getting data. And we’ll need to choose a hash function. Let’s start by setting up our struct and variables, we’ll use sha1 in this example. All code is available on github.

type filter struct {
bitfield [100]bool
}
var hasher = sha1.New()

Next, we’ll want to create a function to take a string input and return the hash from this. We’ll want the hash as a number and not a string representation, as we need to map it to a location in our bitfield.

func createHash(h hash.Hash, input string) int {
h.Write([]byte(input))
bits := h.Sum(nil)
buf := bytes.NewBuffer(bits)
result, _ := binary.ReadVarint(buffer)
return int(result) // cast the int64
}

As it’s an example, let’s ignore the error. 😱
We’ll need to map this result to a location in our bitfield. As we’ll need to get the same location for reads as well as writes, it’s best to create a separate function out of this. (To ensure that the same input always returns the same output).

func (f *filter) hashPosition(s string) int {
hs := createHash(hasher, s)
if hs < 0 {
hs = -hs // ensure a positive index
}
return hs % len(f.bitfield)
}

Great, now we can chain this together with a set and get function.

func (f *filter) Set(s string) {
pos := f.hashPosition(s)
f.bitfield[pos] = true
}
func (f *filter) get(s string) bool {
return f.bitfield[f.hashPosition(s)]
}

That’s all there is to create a basic bloom filter. When we store data in it using f.Set("https://www.abc.com") we’ll flip a bit to 1 to indicate that it has been set. When we run f.Get("https://www.abc.com") we use the same hash to find out if the bit has been set.

Better bloom filter

The above example is an implementation of a quite basic bloom filter, but it’s enough to get the idea of a bloom filter across. But what can we do to make this bloom filter better?

A good bloom filter gives a small amount of false positives, as this avoids the (expensive) lookup in the real data set. One way of doing this is by adding more hash functions, thus decreasing our odds of multiple items mapping to the same location in our bitfield.

A natural question then arises, “how many hash functions do we use?”. It turns out we can calculate how many hash functions we should use. For any given number (m) of bits in the bitfield and number (n) of elements we want to store in the bloom filter, the optimum number of hash functions (k) is given by the formula:

Formula for optimal hash functions

Hence our basic example might be fine for smaller bloom filters, but in the real world you’ll encounter bloom filters that use multiple hash functions.

If you liked this post and 💙 Go as well, consider:

  • Following me here, on Medium
  • Or twitter Twitter
  • Or check out workwithgo.com to find cool Go jobs around the world

--

--