A quick introduction to Bloom Filters

When designing large scalable systems, this idea will help a ton

Devansh
Geek Culture
5 min readSep 19, 2022

--

How can you check if an element is a member of an extremely large dataset? Take your Medium account for example. There are millions of users on Medium. So when a new user tries to create a username, how can Medium check if that username already exists? If search through all the usernames would be extremely expensive. For smaller sets, you can use a HashSet to search through. However, you would have to load that entire dataset into memory before searching. This is going to be slow and costly. So how can you check if an element is a member of an extremely large dataset, in a simple and cost-effective manner?

Below is a piece from my Daily Newsletter- Technology Made Simple- answering this question. The post has been copied to give you an understanding of what the newsletter is like. If you’re interested in more such pieces, subscribe to the newsletter. Details at the end.

Imagine you had a very large data set. Many many members, with each member being very complex. Maybe your social media platform took off, and you have millions of users with funky usernames (and even more complex user IDs).

Now every time you want to check whether a new ID is a part of the set, what would you do? LinkedList and Arrays would be too slow. Tries would need to be constantly updated and balanced, making them problematic. Why not try Hash maps/sets? Everyone always says that they have O(1) insertion, lookup, etc.

Bloom Filters help you reduce unnecessary disk access. This makes them amazing for applications with lots of data.

However, think back to the piece on hashing we did. Remember that hashes are implemented on top of arrays. If you had to access the underlying storage every time, your costs would shoot through the roof. Remember keeping lots of data stored is cheap. Loading it to work on is much costlier. So how can we test for membership w/o killing our financials? That is what we will learn about.

If you’re looking for the best results in your interviews/tech journey, reach out.

Fruit flies use a modified version of Bloom filters to detect novelty of odors, with additional features including similarity of novel odor to that of previously experienced examples, and time elapsed since previous experience of the same odor.[11]

- This is Wikipedia’s first example of the use of Bloom Filter.

Highlights

  1. What are bloom filters- Bloom filters are probabilistic data structures meant to test for membership in an efficient manner.
  2. How they work (insertion)- Have a bit array of size m (all values initialized to 0). Have k hash functions, all of which map to index values in the array (go from 0-(m-1) ). Given an input, run it through all the hash functions. This gives us k different values. Go over the array, and set all the indices for 1.
  3. Querying- Take input, take it through our k functions. If any of the indices are at value 0, then we haven’t seen this input.
  4. False Positives/Negatives- Bloom filters are probabilistic. They can tell you if an element is probably in the storage. Sometimes they will say yes to elements that don’t exist. They have no false negatives. They will never say an element DNE if it does.
  5. Regeneration- As we start adding more values, the rate of False Positives will go up. Eventually, every query will return true. Once our filter fills up beyond a point, it makes sense to create a bigger one/replace it. Fun fact- a filter of any fixed size can be used for an infinite number of elements. Don’t do this, but keep this as trivia.
  6. Storage- Bloom Filters don’t actually store the elements in the database. You need something else for it. They just make the process of searching much cheaper. The tradeoff is that we will

Bloom Filters are one of the most important data structures being implemented right now. There are many variants of the filter, each specialized for different tasks. For example, Content Delivery Networks use them to optimize the storage used in the cache.

There are over 60 kinds of Bloom Filters. Image Source

Bloom Filters are a great tool to have in your tool kit. Whether you’re a software developer trying to build amazingly scalable systems, a Machine Learning Engineer looking for an easy way to test for similarities between two vectors of user behavior, or a backend bro looking to optimize the search and querying protocols, Bloom Filters will be very useful to your needs.

For more such writeups, check out my newsletter Technology Interviews Made Simple. Tech Made Simple is the best resource for people looking to build amazing careers in Tech. It will help you conceptualize, build, and optimize your solutions. It covers everything from technical aspects from System Design, Computer Science concepts, and Leetcode Problem solving to detailed guides on networking and building your career. Save your time, effort, and money by finding all your needs met in one place. Use the link here to get 20% off, for up to a whole year.

I created Technology Interviews Made Simple using new techniques discovered through tutoring multiple people into top tech firms. The newsletter is designed to help you succeed, saving you from hours wasted on the Leetcode grind. I have a 100% satisfaction policy, so you can try it out at no risk to you. You can read the FAQs and find out more here

Feel free to reach out if you have any interesting jobs/projects/ideas for me as well. Always happy to hear you out.

For monetary support of my work following are my Venmo and Paypal. Any amount is appreciated and helps a lot. Donations unlock exclusive content such as paper analysis, special code, consultations, and specific coaching:

Venmo: https://account.venmo.com/u/FNU-Devansh

Paypal: paypal.me/ISeeThings

Reach out to me

Use the links below to check out my other content, learn more about tutoring, or just to say hi. Also, check out the free Robinhood referral link. We both get a free stock (you don’t have to put any money), and there is no risk to you. So not using it is just losing free money.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

If you’re preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/

Get a free stock on Robinhood: https://join.robinhood.com/fnud75

--

--

Devansh
Devansh

Written by Devansh

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction