Implementing a Cuckoo Filter in Go

Dylan Meeus
Dec 22, 2019 · 5 min read

In a recent post we have looked at a bloom filter implementation, which is a commonly used probabilistic data structure to provide fast membership tests. A quite similar data structure is the Cuckoo Filter.

A Cuckoo Filter, as opposed to a bloom filter gives us the possibility to not only store items but also delete them from the filter without sacrificing the performance for set membership tests that we get from a traditional Bloom Filter. In addition, a cuckoo filter that stores many items and targets a low false-positive rate will have a lower space overhead than space-optimized bloom filters.

Why would you want to choose a cuckoo filter over a bloom filter?

  • They support adding / removing items dynamically
  • Higher lookup performance, even when 95% space is utilized
  • Easier to implement than alternatives
  • Less space used than bloom filters in *most* cases

Similarly as with the bloom filter, you can not retrieve what data was inserted. A cuckoo filter will only store a hash (fingerprint) of the item that is being inserted rather than the item itself.

In case you don’t want to walk through the entire implementation step by step, the full code can be found on Github.

Setting up the Cuckoo Filter

The high-level idea behind a cuckoo filter is just fingerprints of input are being stored in the filter. These fingerprints are stored in ‘buckets’ similar to how it would happen in a HashMap implementation.

Hence we’ll need to store the bucket but also some metadata for the filter and for the buckets.

// Cuckoo filter based on https://www.pdl.cmu.edu/PDL-FTP/FS/cuckoo-conext2014.pdf
type Cuckoo struct {
buckets []bucket
m uint // buckets
b uint // entries per bucket
f uint // fingerprint length
n uint // filter capacity (rename cap?)
}

In addition, we’ll also create a type alias for buckets and fingerprints. Because fingerprints are generated with a hash function, we’ll import crypto/sha1 as well but you could pick another hash function.

type bucket []fingerprint
type fingerprint []byte
var hasher = sha1.New()

We can tweak our cuckoo filter to change the rate of false-positives and space-time efficiency. This depends on what a user wants from the filter, hence we can use a “constructor” that accepts a requested capacity and a required false-positive rate.

// n = len(items), fp = false positive rate
func NewCuckoo(n uint, fp float64) *Cuckoo {
b := uint(4)
f := fingerprintLength(b, fp)
m := nextPower(n / f * 8)
buckets := make([]bucket, m)
for i := uint(0); i < m; i++ {
buckets[i] = make(bucket, b)
}
return &Cuckoo{
buckets: buckets,
m: m,
b: b,
f: f,
n: n,
}
}

If you’re wondering where the fingerprintLength and nextPower function come from, this paper sheds some light on that. I’d rather focus on the insertion / lookup and deletion algorithms in the remainder of this blog post (they’re in pseudo-code in the paper as well) 😃

Inserting values

When inserting a value, we’re really just inserting a hash of the input. Instead of just creating one hash, we’ll create a second hash based on the initial fingerprint as well. This gives us two buckets where we could store our data.

We’ll try to enter our fingerprint in either of the two buckets as long as the bucket does not overflow. If the bucket is full, we’ll kick an item out the bucket and move it to another one.

The hash function we are using to create our hashes for the first and second bucket is based on the Sha1 hash:

func (c *Cuckoo) hashes(data string) (uint, uint, fingerprint) {
h := hash([]byte(data))
f := h[0:c.f]
i1 := uint(binary.BigEndian.Uint32(h))
i2 := i1 ^ uint(binary.BigEndian.Uint32(hash(f)))
return i1, i2, fingerprint(f)
}

func hash(data []byte) []byte {
hasher.Write([]byte(data))
hash := hasher.Sum(nil)
hasher.Reset()
return hash
}

A high level overview of the insertion algorithm using this hash would be:

// try to store in bucket 1
// if success-> done
// if not -> try to store in bucket 2
// if success -> done
// if not ->
// while retry < retryLimit
// pick random entry (r) from bucket 1
// move entry (r) to alternate location
// try store in new bucket
// if success -> done

More visually:

(source: https://www.pdl.cmu.edu/PDL-FTP/FS/cuckoo-conext2014.pdf)

Translated to code, trying to store the in the first or second bucket this would look like:

i1, i2, f := c.hashes(input)
// first try bucket one
b1 := c.buckets[i1%c.m]
if i, err := b1.nextIndex(); err == nil {
b1[i] = f
return
}

b2 := c.buckets[i2%c.m]
if i, err := b2.nextIndex(); err == nil {
b2[i] = f
return
}

In the best case scenario, it ends here. But in case both our buckets were full already, we’ll start kicking things out of bucket 1 and move the item that got kicked out to its alternate position.

// start relocating items
i := i1
for r := 0; r < retries; r++ {
index := i % c.m
entryIndex := rand.Intn(int(c.b))
// swap
f, c.buckets[index][entryIndex] = c.buckets[index][entryIndex], f
i = i ^ uint(binary.BigEndian.Uint32(hash(f)))
b := c.buckets[i%c.m]
if idx, err := b.nextIndex(); err == nil {
b[idx] = f
c.count++
return
}
}
panic("cuckoo filter full")

I panic if our cuckoo filter is full, but… this should probably be handled gracefully 😅

Membership test

A cuckoo filter would be quite useless if we could only store items but not query it. Just as with our bloom filter, a lookup is actually quite easy. We check if the item’s fingerprint is in bucket 1 or bucket 2, and if so we have found our (potential) match. If not, it’s definitely not stored.

// lookup needle in the cuckoo nest :)
func (c *Cuckoo) lookup(needle string) bool {
i1, i2, f := c.hashes(needle)
return c.buckets[i1%c.m].contains(f) || c.buckets[i2%c.m].contains(f)
}
// returns the index and successful lookup
func (b bucket) contains(f fingerprint) (int, bool) {
for i, x := range b {
if bytes.Equal(x, f) {
return i, true
}
}
return -1, false
}

Deletion

So far our filter has the same functionality as a bloom filter, what will set it apart here is the delete function. Deleting an item makes the false-positive remain unchanged, as long as you delete items that have previous been inserted. When you try to delete an item that has not been inserted — you could end up deleting a fingerprint from a colliding element.

We can create a delete function by stitching together the functions we’ve already written:

// delete the fingerprint from the cuckoo filter
func (c *Cuckoo) delete(needle string) {
i1, i2, f := c.hashes(needle)
// try to remove from f1
b1 := c.buckets[i1%c.m]
if ind, ok := b1.contains(f); ok {
b1[ind] = nil
return
}

b2 := c.buckets[i2%c.m]
if ind, ok := b2.contains(f); ok {
b2[ind] = nil
return
}
}

Resources

The code in this blog was based on this paper: https://www.pdl.cmu.edu/PDL-FTP/FS/cuckoo-conext2014.pdf

The repository with all code can be found on github: https://github.com/DylanMeeus/MediumCode/


If you liked this post and ❤️ Go as well, consider:

  • Following me here, on Medium
  • Or twitter Twitter
  • Or check out workwithgo.com to find cool Go jobs around the world. 😄

Dylan Meeus

Written by

Gopher | workwithgo.com | github.com/DylanMeeus

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade