Let's implement a Bloom Filter
I am planning to create a series of blog posts that includes some literature research, implementation of various data structures and our journey of creating a distributed datastore in distrentic.io.
You might be wondering why I start with a blog post explaining the Bloom Filter while I don’t have single clue about how to create a distributed datastore? My answer is simple: “I like the idea behind it”.
Before I get into the details of the Bloom filters, I want to give our backstory that will help you understand why we started building something we’d enjoy during our spare time that will never be production ready.
The backstory
My friend Ibrahim and I are always fascinated by complex software and distibuted systems - We’ve been working together more than 5 years (we got old dude) and we were lucky enough to work for the largest e-commerce company in Europe. We battled our way solving many different problems that distributed systems can offer. We both moved to Cambridge, UK and still fighting against distributed world villains.
Let’s explore the mystic land of probabilistic data structures by implementing a Bloom Filter.
What the hell is a Bloom Filter
You might also want to read Bloom filters debunked.
A Bloom filter is a method for representing a set
It is a space-efficient probabilistic data structure that is used to answer a very simple question: is this element a member of a set?. A Bloom filter does not store the actual elements, it only stores the membership of them.
False positive matches are possible, but false negatives are not – in other words, a query returns either “possibly in set” or “definitely not in set”. 2 Unfortunately, this also means items cannot be removed from the Bloom Filter (Some other element or group of elements may be hashed to the same indices).
Because of its nature of being probabilistic, the Bloom Filter trades space and performance for accuracy. This is much like the CAP theorem, we choose performance over accuracy.
Bloom filters have some interesting use cases. For example, they can be placed on top of a datastore. When a key is queried for its existence and the filter does not have it, we can skip querying the datastore entirely.
Figure 1: Example usage of a Bloom filter.
How does it work
The idea behind Bloom filter is very simple: Allocate an array
The Bloom filter has two operations just like a standard set:
Insertion
When an element
Figure 2: Adding elements to a Bloom filter (
Query
To query the membership of an element
A Bloom filter eventually reaches a point where all bits are set, which means every query will indicate membership, effectively making the probability of false positives
Finding optimal and
We can derive optimal
If you want to learn about how the above formulae are derived, you might want to pay a visit here.
Rust implementation
You can find the full implementation here. Huge thanks to @xfix for fixing
hashers
initialization with the same seed and @dkales for spotting an issue with the ordering of operations in index calculation.
Finally! It is time to write some rust
. I am simultaneously implementing the bloom filter whilst writing this blog post. If you don’t believe me then check the below command:
cargo new --lib plum
Let’s continue with the dependencies. There is only one dependency and we will use it to create
[dependencies]
bit-vec = "0.6"
We will declare a new struct
StandardBloomFilter
to encapsulate required fields struct
“owns” a T
.
extern crate bit_vec;
use bit_vec::BitVec;
use std::collections::hash_map::{DefaultHasher, RandomState};
use std::hash::{BuildHasher, Hash, Hasher};
use std::marker::PhantomData;
pub struct StandardBloomFilter<T: ?Sized> {
bitmap: BitVec,
optimal_m: u64,
optimal_k: u32,
hashers: [DefaultHasher; 2],
_marker: PhantomData<T>,
}
Careful readers will think that I made a mistake in the declaration of hashers
array because of the requirement of
Why two hash functions? Kirsch and Mitzenmacher demonstrated in their paper that using two hash functions
and to simulate additional hash functions of the form can be usefully applied to Bloom filters. This leads to less computation and potentially less need for randomness in practice. 4 This formula may appear similar to the use of pairwise indenpendent hash functions. Unfortunately, there is no formal connection between the two techniques.
I mentioned earlier that the Bloom Filter has two operations like a standard set: insert and query. We will implement those two operations along with constructor-like new
method.
impl<T: ?Sized> StandardBloomFilter<T> {
pub fn new(items_count: usize, fp_rate: f64) -> Self {
// ...snip
}
}
new
calculates the size of the bitmap
(optimal_k
(StandardBloomFilter
.
impl<T: ?Sized> StandardBloomFilter<T> {
pub fn new(items_count: usize, fp_rate: f64) -> Self {
let optimal_m = Self::bitmap_size(items_count, fp_rate);
let optimal_k = Self::optimal_k(fp_rate);
let hashers = [
RandomState::new().build_hasher(),
RandomState::new().build_hasher(),
];
StandardBloomFilter {
bitmap: BitVec::from_elem(optimal_m as usize, false),
optimal_m,
optimal_k,
hashers,
_marker: PhantomData,
}
}
// ...snip
fn bitmap_size(items_count: usize, fp_rate: f64) -> usize {
let ln2_2 = core::f64::consts::LN_2 * core::f64::consts::LN_2;
((-1.0f64 * items_count as f64 * fp_rate.ln()) / ln2_2).ceil() as usize
}
fn optimal_k(fp_rate: f64) -> u32 {
((-1.0f64 * fp_rate.ln()) / core::f64::consts::LN_2).ceil() as u32
}
// ...snip
}
Let’s run these calculations on Rust Playground.
bitmap_size: 9585059
optimal_k: 7
This looks really promising! A Bloom Filter that represents a set of
We managed to construct a Bloom Filter so far and it is time to implement insert
and contains
methods. Their implementations are dead simple and they share the same code to calculate indexes of the bit array.
impl<T: ?Sized> StandardBloomFilter<T> {
// ...snip
pub fn insert(&mut self, item: &T)
where
T: Hash,
{
let (h1, h2) = self.hash_kernel(item);
for k_i in 0..self.optimal_k {
let index = self.get_index(h1, h2, k_i as u64);
self.bitmap.set(index, true);
}
}
pub fn contains(&mut self, item: &T) -> bool
where
T: Hash,
{
let (h1, h2) = self.hash_kernel(item);
for k_i in 0..self.optimal_k {
let index = self.get_index(h1, h2, k_i as u64);
if !self.bitmap.get(index).unwrap() {
return false;
}
}
true
}
}
The above methods depend on two other methods that we haven’t implemented yet: hash_kernel
and get_index
. hash_kernel
is going to be the one where the actual “hashing” happens. It will return the hash values of
impl<T: ?Sized> StandardBloomFilter<T> {
// ...snip
fn hash_kernel(&self, item: &T) -> (u64, u64)
where
T: Hash,
{
let hasher1 = &mut self.hashers[0].clone();
let hasher2 = &mut self.hashers[1].clone();
item.hash(hasher1);
item.hash(hasher2);
let hash1 = hasher1.finish();
let hash2 = hasher2.finish();
(hash1, hash2)
}
}
We could’ve used hash1
and the lower as hash2
but to keep this implementation even simpler (this is how Google Guava Bloom Filter implementation currently works) and not to rely on any other additional dependencies I decided to continue with DefaultHasher
- see SipHash
Now, it is time to make the final touch. We are going to implement get_index
by using
impl<T: ?Sized> StandardBloomFilter<T> {
// ...snip
fn get_index(&self, h1: u64, h2: u64, k_i: u64) -> usize {
(h1.wrapping_add((k_i).wrapping_mul(h2)) % self.optimal_m) as usize
}
// ...snip
We are finally there
We’ve just finished implementing a fast variant of a standard Bloom Filter but there is still one thing missing - We didn’t write any tests.
Let’s add two simple test cases and validate our implementation.
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn insert() {
let mut bloom = StandardBloomFilter::new(100, 0.01);
bloom.insert("item");
assert!(bloom.contains("item"));
}
#[test]
fn check_and_insert() {
let mut bloom = StandardBloomFilter::new(100, 0.01);
assert!(!bloom.contains("item_1"));
assert!(!bloom.contains("item_2"));
bloom.insert("item_1");
assert!(bloom.contains("item_1"));
}
}
❯ cargo test
Compiling plum v0.1.2 (/Users/onat.mercan/dev/distrentic/plum)
Finished test [unoptimized + debuginfo] target(s) in 0.99s
Running target/debug/deps/plum-6fc161db530d5b36
running 2 tests
test tests::insert ... ok
test tests::check_and_insert ... ok
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
I hope you’ve enjoyed reading this post as much as I enjoyed writing it!
If you find anything wrong with the code, you can file an issue or, even better, submit a pull request.
Further Reading
- Writing a full-text search engine using Bloom filters - Stavros’ Stuff
- Bloom Filters by Example
- Serving over 1 billion documents per day with Docstore v2 - Indeed Engineering Blog
- LSMTrees+Bloom - wiredtiger/wiredtiger
- SLSM - A Scalable Log Structured Merge Tree with Bloom Filters for Low Latency Analytics - ScienceDirect
- Optimal Bloom Filters and Adaptive Merging for LSM-Trees
- Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters