Bloom Filters

Introduction

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can tell you with certainty that an element is not in the set, but can only tell you that an element is probably in the set — meaning false positives are possible, but false negatives are not. Bloom filters are fundamental building blocks in databases, caches, network routers, and distributed systems where fast, memory-efficient membership testing is critical.

Core Concepts

How a Bloom Filter Works

At its heart, a Bloom filter is a bit array of m bits, all initially set to 0, combined with k independent hash functions. Each hash function maps an element to one of the m positions in the bit array uniformly.

Insertion: To add an element, feed it through all k hash functions. Each hash function produces an index. Set the bits at all k indices to 1.

Lookup: To query for an element, feed it through the same k hash functions. If all bits at the k indices are 1, the element is probably in the set. If any bit is 0, the element is definitely not in the set.

Deletion: Standard Bloom filters do not support deletion because clearing a bit could affect other elements that hash to the same position. Counting Bloom filters solve this with counters instead of single bits.

False Positive Probability

The probability of a false positive after inserting n elements into a Bloom filter with m bits and k hash functions is approximately:

P(false positive) ≈ (1 - e^(-kn/m))^k

This formula drives the design decisions for choosing m and k given a desired false positive rate p and expected number of elements n:

Optimal number of bits: m = -(n × ln(p)) / (ln(2))²
Optimal number of hash functions: k = (m / n) × ln(2)

Trade-offs: Space vs. Accuracy

False Positive Rate	Bits per Element	Hash Functions
1%	9.6	7
0.1%	14.4	10
0.01%	19.2	13
0.001%	24.0	17

Compared to a HashSet which might use hundreds of bits per element (object overhead, pointers, load factor), a Bloom filter at 1% FP rate uses fewer than 10 bits per element — a compression ratio of 10–50x.

Bloom Filter vs. Other Structures

Implementation: Basic Bloom Filter in Java

The following is a complete, runnable Bloom filter implementation using MurmurHash-style double hashing to simulate k independent hash functions from just two base hashes.

java

import java.util.BitSet;

public class BloomFilter<T> {

    private final BitSet bitSet;
    private final int bitSize;
    private final int hashCount;
    private int insertedCount;

    /**
     * Creates a Bloom filter with optimal parameters.
     *
     * @param expectedElements expected number of elements
     * @param falsePositiveRate desired false positive probability (e.g., 0.01)
     */
    public BloomFilter(int expectedElements, double falsePositiveRate) {
        this.bitSize = optimalBitSize(expectedElements, falsePositiveRate);
        this.hashCount = optimalHashCount(bitSize, expectedElements);
        this.bitSet = new BitSet(bitSize);
        this.insertedCount = 0;
    }

    /**
     * m = -(n * ln(p)) / (ln(2))^2
     */
    private static int optimalBitSize(int n, double p) {
        return (int) Math.ceil(-(n * Math.log(p)) / (Math.log(2) * Math.log(2)));
    }

    /**
     * k = (m / n) * ln(2)
     */
    private static int optimalHashCount(int m, int n) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
    }

    /**
     * Double hashing technique: h_i(x) = h1(x) + i * h2(x)
     * Kirsch-Mitzenmacker optimization — two hashes suffice.
     */
    private int[] getHashIndices(T element) {
        int hash1 = element.hashCode();
        int hash2 = spread(hash1);

        int[] indices = new int[hashCount];
        for (int i = 0; i < hashCount; i++) {
            long combinedHash = (long) hash1 + (long) i * hash2;
            indices[i] = (int) ((combinedHash & Long.MAX_VALUE) % bitSize);
        }
        return indices;
    }

    /**
     * Additional bit spreading to reduce clustering.
     */
    private int spread(int hash) {
        hash ^= (hash >>> 16);
        hash *= 0x85ebca6b;
        hash ^= (hash >>> 13);
        hash *= 0xc2b2ae35;
        hash ^= (hash >>> 16);
        return hash;
    }

    public void put(T element) {
        int[] indices = getHashIndices(element);
        for (int index : indices) {
            bitSet.set(index);
        }
        insertedCount++;
    }

    public boolean mightContain(T element) {
        int[] indices = getHashIndices(element);
        for (int index : indices) {
            if (!bitSet.get(index)) {
                return false;  // Definitely not in the set
            }
        }
        return true;  // Probably in the set
    }

    public double expectedFalsePositiveRate() {
        return Math.pow(
            1 - Math.exp(-(double) hashCount * insertedCount / bitSize),
            hashCount
        );
    }

    public int getBitSize() { return bitSize; }
    public int getHashCount() { return hashCount; }
    public int getInsertedCount() { return insertedCount; }

    @Override
    public String toString() {
        return String.format(
            "BloomFilter[bits=%d, hashes=%d, elements=%d, estimatedFPR=%.6f]",
            bitSize, hashCount, insertedCount, expectedFalsePositiveRate()
        );
    }
}

Demonstrating Usage

java

public class BloomFilterDemo {

    public static void main(String[] args) {
        // Create filter expecting 10,000 elements with 1% false positive rate
        BloomFilter<String> filter = new BloomFilter<>(10_000, 0.01);

        System.out.println("=== Bloom Filter Configuration ===");
        System.out.printf("Bit array size: %,d bits (%.1f KB)%n",
            filter.getBitSize(), filter.getBitSize() / 8.0 / 1024);
        System.out.printf("Hash functions: %d%n", filter.getHashCount());

        // Insert known elements
        String[] knownUsers = {"alice", "bob", "charlie", "diana", "eve"};
        for (String user : knownUsers) {
            filter.put(user);
        }

        // Test known elements (should all return true)
        System.out.println("\n=== Testing Known Elements ===");
        for (String user : knownUsers) {
            System.out.printf("  mightContain('%s'): %s%n",
                user, filter.mightContain(user));
        }

        // Test unknown elements (should mostly return false)
        System.out.println("\n=== Testing Unknown Elements ===");
        String[] unknownUsers = {"frank", "grace", "heidi", "ivan", "judy"};
        for (String user : unknownUsers) {
            System.out.printf("  mightContain('%s'): %s%n",
                user, filter.mightContain(user));
        }

        // Measure actual false positive rate with bulk test
        System.out.println("\n=== False Positive Rate Experiment ===");
        BloomFilter<String> bulkFilter = new BloomFilter<>(10_000, 0.01);
        for (int i = 0; i < 10_000; i++) {
            bulkFilter.put("element_" + i);
        }

        int falsePositives = 0;
        int testSize = 100_000;
        for (int i = 0; i < testSize; i++) {
            if (bulkFilter.mightContain("nonexistent_" + i)) {
                falsePositives++;
            }
        }

        System.out.printf("Tested %,d absent elements%n", testSize);
        System.out.printf("False positives: %,d%n", falsePositives);
        System.out.printf("Actual FP rate: %.4f%%%n",
            (double) falsePositives / testSize * 100);
        System.out.printf("Expected FP rate: %.4f%%%n",
            bulkFilter.expectedFalsePositiveRate() * 100);
        System.out.println(bulkFilter);
    }
}

Implementation: Counting Bloom Filter

A Counting Bloom filter replaces each bit with a counter, enabling deletion.

java

public class CountingBloomFilter<T> {

    private final int[] counters;
    private final int size;
    private final int hashCount;

    public CountingBloomFilter(int expectedElements, double falsePositiveRate) {
        this.size = (int) Math.ceil(
            -(expectedElements * Math.log(falsePositiveRate))
            / (Math.log(2) * Math.log(2))
        );
        this.hashCount = Math.max(1,
            (int) Math.round((double) size / expectedElements * Math.log(2))
        );
        this.counters = new int[size];
    }

    private int[] getIndices(T element) {
        int hash1 = element.hashCode();
        int hash2 = hash1 ^ (hash1 >>> 16) * 0x85ebca6b;
        int[] indices = new int[hashCount];
        for (int i = 0; i < hashCount; i++) {
            long combined = (long) hash1 + (long) i * hash2;
            indices[i] = (int) ((combined & Long.MAX_VALUE) % size);
        }
        return indices;
    }

    public void put(T element) {
        for (int index : getIndices(element)) {
            counters[index]++;
        }
    }

    public void remove(T element) {
        int[] indices = getIndices(element);
        // Verify element might exist before decrementing
        for (int index : indices) {
            if (counters[index] <= 0) {
                throw new IllegalArgumentException(
                    "Element was never added: " + element);
            }
        }
        for (int index : indices) {
            counters[index]--;
        }
    }

    public boolean mightContain(T element) {
        for (int index : getIndices(element)) {
            if (counters[index] <= 0) {
                return false;
            }
        }
        return true;
    }

    public static void main(String[] args) {
        CountingBloomFilter<String> filter =
            new CountingBloomFilter<>(1000, 0.01);

        filter.put("session_abc123");
        filter.put("session_def456");

        System.out.println(filter.mightContain("session_abc123")); // true
        System.out.println(filter.mightContain("session_def456")); // true
        System.out.println(filter.mightContain("session_ghi789")); // false

        // Remove a session
        filter.remove("session_abc123");
        System.out.println(filter.mightContain("session_abc123")); // false
        System.out.println(filter.mightContain("session_def456")); // still true
    }
}

Real-World Use Cases

Database Query Optimization

Bloom filters are used extensively in databases like Apache Cassandra and HBase to avoid unnecessary disk reads. Before performing an expensive disk I/O to check if a key exists in an SSTable, the system checks a Bloom filter in memory.

Web Crawler URL Deduplication

Cache Penetration Protection

A Bloom filter sits in front of a cache and database to prevent queries for keys that definitely don't exist.

java

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class CacheWithBloomFilter {

    private final BloomFilter<String> bloomFilter;
    private final Map<String, String> cache;

    public CacheWithBloomFilter(int expectedKeys) {
        this.bloomFilter = new BloomFilter<>(expectedKeys, 0.001);
        this.cache = new ConcurrentHashMap<>();
    }

    /**
     * Registers a key as a valid key in the system.
     */
    public void registerValidKey(String key) {
        bloomFilter.put(key);
    }

    public String get(String key) {
        // Step 1: Check Bloom filter
        if (!bloomFilter.mightContain(key)) {
            System.out.println("[BLOOM] Key '" + key
                + "' definitely doesn't exist — skipping DB");
            return null;
        }

        // Step 2: Check cache
        String cached = cache.get(key);
        if (cached != null) {
            System.out.println("[CACHE HIT] " + key);
            return cached;
        }

        // Step 3: Query database (simulated)
        System.out.println("[DB QUERY] " + key);
        String value = queryDatabase(key);
        if (value != null) {
            cache.put(key, value);
        }
        return value;
    }

    private String queryDatabase(String key) {
        // Simulated database lookup
        if (key.startsWith("user_")) {
            return "{name: '" + key + "', status: 'active'}";
        }
        return null;
    }

    public static void main(String[] args) {
        CacheWithBloomFilter service = new CacheWithBloomFilter(100_000);

        // Register valid keys
        service.registerValidKey("user_1001");
        service.registerValidKey("user_1002");
        service.registerValidKey("user_1003");

        // Valid queries
        System.out.println(service.get("user_1001"));
        System.out.println(service.get("user_1001")); // cache hit

        // Attack: querying non-existent keys
        System.out.println(service.get("nonexistent_key_999"));
        System.out.println(service.get("random_attack_key"));
    }
}

Architecture: Bloom Filter in a Distributed System

Bloom Filter Variants

Scalable Bloom Filter

A scalable Bloom filter chains multiple standard Bloom filters together, each with a tighter false positive rate, to accommodate an unknown number of elements.

java

import java.util.ArrayList;
import java.util.List;

public class ScalableBloomFilter<T> {

    private final List<BloomFilter<T>> filters;
    private final int initialCapacity;
    private final double falsePositiveRate;
    private final double tighteningRatio;
    private int currentCapacity;

    public ScalableBloomFilter(int initialCapacity, double falsePositiveRate) {
        this.initialCapacity = initialCapacity;
        this.falsePositiveRate = falsePositiveRate;
        this.tighteningRatio = 0.5; // Each new filter has half the FPR
        this.currentCapacity = initialCapacity;
        this.filters = new ArrayList<>();
        addNewFilter();
    }

    private void addNewFilter() {
        double currentFPR = falsePositiveRate
            * Math.pow(tighteningRatio, filters.size());
        filters.add(new BloomFilter<>(currentCapacity, currentFPR));
    }

    public void put(T element) {
        BloomFilter<T> current = filters.get(filters.size() - 1);
        if (current.getInsertedCount() >= currentCapacity) {
            currentCapacity *= 2; // Double capacity for next filter
            addNewFilter();
            current = filters.get(filters.size() - 1);
        }
        current.put(element);
    }

    public boolean mightContain(T element) {
        for (BloomFilter<T> filter : filters) {
            if (filter.mightContain(element)) {
                return true;
            }
        }
        return false;
    }

    public static void main(String[] args) {
        ScalableBloomFilter<String> filter =
            new ScalableBloomFilter<>(100, 0.01);

        // Add many elements beyond initial capacity
        for (int i = 0; i < 10_000; i++) {
            filter.put("item_" + i);
        }

        // Verify membership
        System.out.println(filter.mightContain("item_0"));      // true
        System.out.println(filter.mightContain("item_9999"));   // true
        System.out.println(filter.mightContain("missing_key")); // false (usually)
    }
}

Memory Comparison Visualization

Lookup Decision Flow

Best Practices

Choose the right false positive rate: Start with 1% for general use; use 0.1% for security-sensitive applications where false positives have real cost.
Size based on maximum expected elements: If you underestimate n, the false positive rate will degrade exponentially. Prefer overestimating by 20–50%.
Use double hashing (Kirsch-Mitzenmacker): You only need two hash functions to simulate k independent hashes — h_i(x) = h1(x) + i × h2(x) — which simplifies implementation without sacrificing quality.
Prefer scalable Bloom filters for unknown cardinalities: When you cannot predict the number of elements upfront, use a scalable variant that chains filters rather than risk degraded performance.
Never rely on Bloom filters for correctness: Always treat a positive result as "maybe" and verify against an authoritative data source when the consequence of a false positive is significant.
Use counting Bloom filters only when deletion is required: They consume 3–4x more memory due to counters. If you don't need deletion, stick with the standard variant.
Monitor the fill ratio: Track bitSet.cardinality() / bitSize. When this exceeds 50%, false positive rates increase rapidly. Consider rebuilding or switching to a larger filter.
Leverage existing libraries in production: Google Guava's BloomFilter class is production-tested and handles hashing, serialization, and union operations. Avoid rolling your own in production systems.
Serialize filters for persistence: Bloom filters are just bit arrays — they serialize trivially. Store them alongside SSTables, indexes, or ship them across the network for distributed membership checks.
Combine with write-ahead strategies: Populate the Bloom filter during write operations. Never reconstruct it from scratch at query time — that defeats its purpose.

REST HTTP Verbs and Status Codes — Bloom filters help optimize API response times by eliminating unnecessary backend lookups.
Eventual Consistency — In eventually consistent systems, Bloom filters help detect missing replicas and anti-entropy synchronization.
High-Performance Streaming Operations — Bloom filters are used in stream processing to deduplicate events in real-time pipelines.

Bloom Filters ​

Introduction ​

Core Concepts ​

How a Bloom Filter Works ​

False Positive Probability ​

Trade-offs: Space vs. Accuracy ​

Bloom Filter vs. Other Structures ​

Implementation: Basic Bloom Filter in Java ​

Demonstrating Usage ​

Implementation: Counting Bloom Filter ​

Real-World Use Cases ​

Database Query Optimization ​

Web Crawler URL Deduplication ​

Cache Penetration Protection ​

Architecture: Bloom Filter in a Distributed System ​

Bloom Filter Variants ​

Scalable Bloom Filter ​

Memory Comparison Visualization ​

Lookup Decision Flow ​

Best Practices ​

Related Concepts ​

Bloom Filters

Introduction

Core Concepts

How a Bloom Filter Works

False Positive Probability

Trade-offs: Space vs. Accuracy

Bloom Filter vs. Other Structures

Implementation: Basic Bloom Filter in Java

Demonstrating Usage

Implementation: Counting Bloom Filter

Real-World Use Cases

Database Query Optimization

Web Crawler URL Deduplication

Cache Penetration Protection

Architecture: Bloom Filter in a Distributed System

Bloom Filter Variants

Scalable Bloom Filter

Memory Comparison Visualization

Lookup Decision Flow

Best Practices

Related Concepts