Bloom Filters Explained
Fast lookups with a fraction of the memory, when 'probably yes' is good enough
Hi Friends,
Welcome to the 173rd issue of the Polymathic Engineer newsletter.
In a previous article, we talked about the dictionary problem and discussed different types of data structures that solve it, from unsorted arrays to hash tables and binary search trees. We have seen that hash tables are the best choice when all you need is to check if something is in a collection, since you can look up, add, and remove items in O(1) time.
But we have also seen that hash tables come with a cost: memory. Storing 100K words in a hash table means keeping all those words in memory, plus the extra space that the hash table needs. What if you are working in a context where memory is limited, like a router processing millions of packets per second, or a distributed database that needs to quickly check if an entry exists before writing to disk?
This is where Bloom filters come in. A Bloom filter is a data structure that can tell you if an item is in a set using only a small amount of memory with respect to a hash table. The catch is that it is not always right. A Bloom filter can tell you with certainty that something is not in the set. However, when it indicates something is there, it could be inaccurate. These inaccurate answers are known as false positives, and as we will see, you get to control how often they happen.
In this article, we will look in detail at how Bloom filters work, why they can give false positives but never false negatives, how to tune them to fit your needs, and where they are used in practice. This is the outline:
How Bloom filters differ from hash tables
How Bloom filters work
Why are there no false negatives
Why are there false positives
Tuning the filter
Real-world use cases
Project-based learning is the best way to develop technical skills. CodeCrafters is an excellent platform for tackling exciting projects, such as building your own Redis, Kafka, a DNS server, SQLite, an HTTP server, or Git from scratch using your favorite programming language. Now you can also try to build your own Claude Code (for free, still in beta).
Sign up, and become a better software engineer.
How Bloom filters differ from hash tables
Before we go into the details, it is worth knowing what makes Bloom filters different from hash tables, since both use hashing under the hood. There are three main things to keep in mind.
First, Bloom filters don't store the actual data. A hash table stores the keys (and maybe their values) in memory. A Bloom filter doesn't do that. The sole question it answers is whether or not an item is in the set.
Second, Bloom filters use considerably less memory. This is the main reason why they exist. A hash table for 100,000 strings needs to store all those strings plus the internal structure of the table. A Bloom filter for the same 100K strings only needs a bit array, regardless of how long the strings are.
Third, Bloom filters can give wrong answers. When a Bloom filter says an item is not in the set, it is always right. But it could be inaccurate when it indicates an item is in the set. This is a false positive. There is a trade-off between the memory a Bloom filter employs and how often these false positives happen. The less memory, the more false positives.
There is also one more thing to be aware of: the basic version of Bloom filters doesn’t support deletion. Once you add an item, you can’t remove it. Some advanced variants have been developed to handle removal, but the classic version does not.
If you can live with these constraints, you get a data structure that is as fast as a hash table for lookups and insertions, but uses only a fraction of the memory.
How Bloom filters work
A Bloom filter is made of two things: a bit array with m elements, all initially set to 0, and a set of k hash functions. Each hash function takes a key and returns an index in the range between 0 and m-1.
The most important thing to remember is that there is no direct connection between the bits in the array and the keys you add to the filter. You specify the value of k when you create the filter. Each key is stored using k bits, one for each hash function. This means that every key you add takes up the same amount of memory, k bits, no matter how lengthy it is. A 5-character word and a 500-character URL both use k bits.
Inserting an element
When you add a key to the filter, you give it as an input to all the k hash functions. Each function gives you an index in the bit array, and you set the bits at those k positions to 1. For example, let’s suppose we have a Bloom filter with 15 bits and 3 hash functions, and we want to add the word “binary.” We compute:
h0(”binary”) = 2
h1(”binary”) = 6
h2(”binary”) = 9


