Appearance
Bloom Filters
Introduction
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can tell you with certainty that an element is not in the set, but can only tell you that an element is probably in the set — meaning false positives are possible, but false negatives are not. Bloom filters are fundamental building blocks in databases, caches, network routers, and distributed systems where fast, memory-efficient membership testing is critical.
Core Concepts
How a Bloom Filter Works
At its heart, a Bloom filter is a bit array of m bits, all initially set to 0, combined with k independent hash functions. Each hash function maps an element to one of the m positions in the bit array uniformly.
Insertion: To add an element, feed it through all k hash functions. Each hash function produces an index. Set the bits at all k indices to 1.
Lookup: To query for an element, feed it through the same k hash functions. If all bits at the k indices are 1, the element is probably in the set. If any bit is 0, the element is definitely not in the set.
Deletion: Standard Bloom filters do not support deletion because clearing a bit could affect other elements that hash to the same position. Counting Bloom filters solve this with counters instead of single bits.
False Positive Probability
The probability of a false positive after inserting n elements into a Bloom filter with m bits and k hash functions is approximately:
P(false positive) ≈ (1 - e^(-kn/m))^k
This formula drives the design decisions for choosing m and k given a desired false positive rate p and expected number of elements n:
- Optimal number of bits:
m = -(n × ln(p)) / (ln(2))² - Optimal number of hash functions:
k = (m / n) × ln(2)
Trade-offs: Space vs. Accuracy
| False Positive Rate | Bits per Element | Hash Functions |
|---|---|---|
| 1% | 9.6 | 7 |
| 0.1% | 14.4 | 10 |
| 0.01% | 19.2 | 13 |
| 0.001% | 24.0 | 17 |
Compared to a HashSet which might use hundreds of bits per element (object overhead, pointers, load factor), a Bloom filter at 1% FP rate uses fewer than 10 bits per element — a compression ratio of 10–50x.
Bloom Filter vs. Other Structures
Implementation: Basic Bloom Filter in Java
The following is a complete, runnable Bloom filter implementation using MurmurHash-style double hashing to simulate k independent hash functions from just two base hashes.
java
import java.util.BitSet;
public class BloomFilter<T> {
private final BitSet bitSet;
private final int bitSize;
private final int hashCount;
private int insertedCount;
/**
* Creates a Bloom filter with optimal parameters.
*
* @param expectedElements expected number of elements
* @param falsePositiveRate desired false positive probability (e.g., 0.01)
*/
public BloomFilter(int expectedElements, double falsePositiveRate) {
this.bitSize = optimalBitSize(expectedElements, falsePositiveRate);
this.hashCount = optimalHashCount(bitSize, expectedElements);
this.bitSet = new BitSet(bitSize);
this.insertedCount = 0;
}
/**
* m = -(n * ln(p)) / (ln(2))^2
*/
private static int optimalBitSize(int n, double p) {
return (int) Math.ceil(-(n * Math.log(p)) / (Math.log(2) * Math.log(2)));
}
/**
* k = (m / n) * ln(2)
*/
private static int optimalHashCount(int m, int n) {
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
/**
* Double hashing technique: h_i(x) = h1(x) + i * h2(x)
* Kirsch-Mitzenmacker optimization — two hashes suffice.
*/
private int[] getHashIndices(T element) {
int hash1 = element.hashCode();
int hash2 = spread(hash1);
int[] indices = new int[hashCount];
for (int i = 0; i < hashCount; i++) {
long combinedHash = (long) hash1 + (long) i * hash2;
indices[i] = (int) ((combinedHash & Long.MAX_VALUE) % bitSize);
}
return indices;
}
/**
* Additional bit spreading to reduce clustering.
*/
private int spread(int hash) {
hash ^= (hash >>> 16);
hash *= 0x85ebca6b;
hash ^= (hash >>> 13);
hash *= 0xc2b2ae35;
hash ^= (hash >>> 16);
return hash;
}
public void put(T element) {
int[] indices = getHashIndices(element);
for (int index : indices) {
bitSet.set(index);
}
insertedCount++;
}
public boolean mightContain(T element) {
int[] indices = getHashIndices(element);
for (int index : indices) {
if (!bitSet.get(index)) {
return false; // Definitely not in the set
}
}
return true; // Probably in the set
}
public double expectedFalsePositiveRate() {
return Math.pow(
1 - Math.exp(-(double) hashCount * insertedCount / bitSize),
hashCount
);
}
public int getBitSize() { return bitSize; }
public int getHashCount() { return hashCount; }
public int getInsertedCount() { return insertedCount; }
@Override
public String toString() {
return String.format(
"BloomFilter[bits=%d, hashes=%d, elements=%d, estimatedFPR=%.6f]",
bitSize, hashCount, insertedCount, expectedFalsePositiveRate()
);
}
}Demonstrating Usage
java
public class BloomFilterDemo {
public static void main(String[] args) {
// Create filter expecting 10,000 elements with 1% false positive rate
BloomFilter<String> filter = new BloomFilter<>(10_000, 0.01);
System.out.println("=== Bloom Filter Configuration ===");
System.out.printf("Bit array size: %,d bits (%.1f KB)%n",
filter.getBitSize(), filter.getBitSize() / 8.0 / 1024);
System.out.printf("Hash functions: %d%n", filter.getHashCount());
// Insert known elements
String[] knownUsers = {"alice", "bob", "charlie", "diana", "eve"};
for (String user : knownUsers) {
filter.put(user);
}
// Test known elements (should all return true)
System.out.println("\n=== Testing Known Elements ===");
for (String user : knownUsers) {
System.out.printf(" mightContain('%s'): %s%n",
user, filter.mightContain(user));
}
// Test unknown elements (should mostly return false)
System.out.println("\n=== Testing Unknown Elements ===");
String[] unknownUsers = {"frank", "grace", "heidi", "ivan", "judy"};
for (String user : unknownUsers) {
System.out.printf(" mightContain('%s'): %s%n",
user, filter.mightContain(user));
}
// Measure actual false positive rate with bulk test
System.out.println("\n=== False Positive Rate Experiment ===");
BloomFilter<String> bulkFilter = new BloomFilter<>(10_000, 0.01);
for (int i = 0; i < 10_000; i++) {
bulkFilter.put("element_" + i);
}
int falsePositives = 0;
int testSize = 100_000;
for (int i = 0; i < testSize; i++) {
if (bulkFilter.mightContain("nonexistent_" + i)) {
falsePositives++;
}
}
System.out.printf("Tested %,d absent elements%n", testSize);
System.out.printf("False positives: %,d%n", falsePositives);
System.out.printf("Actual FP rate: %.4f%%%n",
(double) falsePositives / testSize * 100);
System.out.printf("Expected FP rate: %.4f%%%n",
bulkFilter.expectedFalsePositiveRate() * 100);
System.out.println(bulkFilter);
}
}Implementation: Counting Bloom Filter
A Counting Bloom filter replaces each bit with a counter, enabling deletion.
java
public class CountingBloomFilter<T> {
private final int[] counters;
private final int size;
private final int hashCount;
public CountingBloomFilter(int expectedElements, double falsePositiveRate) {
this.size = (int) Math.ceil(
-(expectedElements * Math.log(falsePositiveRate))
/ (Math.log(2) * Math.log(2))
);
this.hashCount = Math.max(1,
(int) Math.round((double) size / expectedElements * Math.log(2))
);
this.counters = new int[size];
}
private int[] getIndices(T element) {
int hash1 = element.hashCode();
int hash2 = hash1 ^ (hash1 >>> 16) * 0x85ebca6b;
int[] indices = new int[hashCount];
for (int i = 0; i < hashCount; i++) {
long combined = (long) hash1 + (long) i * hash2;
indices[i] = (int) ((combined & Long.MAX_VALUE) % size);
}
return indices;
}
public void put(T element) {
for (int index : getIndices(element)) {
counters[index]++;
}
}
public void remove(T element) {
int[] indices = getIndices(element);
// Verify element might exist before decrementing
for (int index : indices) {
if (counters[index] <= 0) {
throw new IllegalArgumentException(
"Element was never added: " + element);
}
}
for (int index : indices) {
counters[index]--;
}
}
public boolean mightContain(T element) {
for (int index : getIndices(element)) {
if (counters[index] <= 0) {
return false;
}
}
return true;
}
public static void main(String[] args) {
CountingBloomFilter<String> filter =
new CountingBloomFilter<>(1000, 0.01);
filter.put("session_abc123");
filter.put("session_def456");
System.out.println(filter.mightContain("session_abc123")); // true
System.out.println(filter.mightContain("session_def456")); // true
System.out.println(filter.mightContain("session_ghi789")); // false
// Remove a session
filter.remove("session_abc123");
System.out.println(filter.mightContain("session_abc123")); // false
System.out.println(filter.mightContain("session_def456")); // still true
}
}Real-World Use Cases
Database Query Optimization
Bloom filters are used extensively in databases like Apache Cassandra and HBase to avoid unnecessary disk reads. Before performing an expensive disk I/O to check if a key exists in an SSTable, the system checks a Bloom filter in memory.
Web Crawler URL Deduplication
Cache Penetration Protection
A Bloom filter sits in front of a cache and database to prevent queries for keys that definitely don't exist.
java
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
public class CacheWithBloomFilter {
private final BloomFilter<String> bloomFilter;
private final Map<String, String> cache;
public CacheWithBloomFilter(int expectedKeys) {
this.bloomFilter = new BloomFilter<>(expectedKeys, 0.001);
this.cache = new ConcurrentHashMap<>();
}
/**
* Registers a key as a valid key in the system.
*/
public void registerValidKey(String key) {
bloomFilter.put(key);
}
public String get(String key) {
// Step 1: Check Bloom filter
if (!bloomFilter.mightContain(key)) {
System.out.println("[BLOOM] Key '" + key
+ "' definitely doesn't exist — skipping DB");
return null;
}
// Step 2: Check cache
String cached = cache.get(key);
if (cached != null) {
System.out.println("[CACHE HIT] " + key);
return cached;
}
// Step 3: Query database (simulated)
System.out.println("[DB QUERY] " + key);
String value = queryDatabase(key);
if (value != null) {
cache.put(key, value);
}
return value;
}
private String queryDatabase(String key) {
// Simulated database lookup
if (key.startsWith("user_")) {
return "{name: '" + key + "', status: 'active'}";
}
return null;
}
public static void main(String[] args) {
CacheWithBloomFilter service = new CacheWithBloomFilter(100_000);
// Register valid keys
service.registerValidKey("user_1001");
service.registerValidKey("user_1002");
service.registerValidKey("user_1003");
// Valid queries
System.out.println(service.get("user_1001"));
System.out.println(service.get("user_1001")); // cache hit
// Attack: querying non-existent keys
System.out.println(service.get("nonexistent_key_999"));
System.out.println(service.get("random_attack_key"));
}
}Architecture: Bloom Filter in a Distributed System
Bloom Filter Variants
Scalable Bloom Filter
A scalable Bloom filter chains multiple standard Bloom filters together, each with a tighter false positive rate, to accommodate an unknown number of elements.
java
import java.util.ArrayList;
import java.util.List;
public class ScalableBloomFilter<T> {
private final List<BloomFilter<T>> filters;
private final int initialCapacity;
private final double falsePositiveRate;
private final double tighteningRatio;
private int currentCapacity;
public ScalableBloomFilter(int initialCapacity, double falsePositiveRate) {
this.initialCapacity = initialCapacity;
this.falsePositiveRate = falsePositiveRate;
this.tighteningRatio = 0.5; // Each new filter has half the FPR
this.currentCapacity = initialCapacity;
this.filters = new ArrayList<>();
addNewFilter();
}
private void addNewFilter() {
double currentFPR = falsePositiveRate
* Math.pow(tighteningRatio, filters.size());
filters.add(new BloomFilter<>(currentCapacity, currentFPR));
}
public void put(T element) {
BloomFilter<T> current = filters.get(filters.size() - 1);
if (current.getInsertedCount() >= currentCapacity) {
currentCapacity *= 2; // Double capacity for next filter
addNewFilter();
current = filters.get(filters.size() - 1);
}
current.put(element);
}
public boolean mightContain(T element) {
for (BloomFilter<T> filter : filters) {
if (filter.mightContain(element)) {
return true;
}
}
return false;
}
public static void main(String[] args) {
ScalableBloomFilter<String> filter =
new ScalableBloomFilter<>(100, 0.01);
// Add many elements beyond initial capacity
for (int i = 0; i < 10_000; i++) {
filter.put("item_" + i);
}
// Verify membership
System.out.println(filter.mightContain("item_0")); // true
System.out.println(filter.mightContain("item_9999")); // true
System.out.println(filter.mightContain("missing_key")); // false (usually)
}
}Memory Comparison Visualization
Lookup Decision Flow
Best Practices
Choose the right false positive rate: Start with 1% for general use; use 0.1% for security-sensitive applications where false positives have real cost.
Size based on maximum expected elements: If you underestimate
n, the false positive rate will degrade exponentially. Prefer overestimating by 20–50%.Use double hashing (Kirsch-Mitzenmacker): You only need two hash functions to simulate
kindependent hashes —h_i(x) = h1(x) + i × h2(x)— which simplifies implementation without sacrificing quality.Prefer scalable Bloom filters for unknown cardinalities: When you cannot predict the number of elements upfront, use a scalable variant that chains filters rather than risk degraded performance.
Never rely on Bloom filters for correctness: Always treat a positive result as "maybe" and verify against an authoritative data source when the consequence of a false positive is significant.
Use counting Bloom filters only when deletion is required: They consume 3–4x more memory due to counters. If you don't need deletion, stick with the standard variant.
Monitor the fill ratio: Track
bitSet.cardinality() / bitSize. When this exceeds 50%, false positive rates increase rapidly. Consider rebuilding or switching to a larger filter.Leverage existing libraries in production: Google Guava's
BloomFilterclass is production-tested and handles hashing, serialization, and union operations. Avoid rolling your own in production systems.Serialize filters for persistence: Bloom filters are just bit arrays — they serialize trivially. Store them alongside SSTables, indexes, or ship them across the network for distributed membership checks.
Combine with write-ahead strategies: Populate the Bloom filter during write operations. Never reconstruct it from scratch at query time — that defeats its purpose.
Related Concepts
- REST HTTP Verbs and Status Codes — Bloom filters help optimize API response times by eliminating unnecessary backend lookups.
- Eventual Consistency — In eventually consistent systems, Bloom filters help detect missing replicas and anti-entropy synchronization.
- High-Performance Streaming Operations — Bloom filters are used in stream processing to deduplicate events in real-time pipelines.