This is the fascinating story of how researchers used Bloom filters cleverly to make SQLite 10x faster for analytical queries. These are my five-minute notes on the paper SQLite: Past, Present, and Future
Basically it’s just an optimization of a double nested for loop. It’s a way to avoid running the inner for loop when it is known there will be no hit.
This is useful when we for example want to find all product orders of customers in a particular country. The way we can do this is to first filter all customers by their country, and then match orders by the remaining customers. The matching step is the double for loop.
Something like this:
for order in orders:
for customer in customers_in_country:
if order.customer_id == customer.id:
…
Many orders won’t match a customer in the above query, so we want to single out these orders before we run the expensive inner for loop. The way they do it is to create a cache using a Bloom filter. I’d recommend looking it up, but it’s a probabilistic cache that’s fast and space efficient, at the cost of letting through some false positives. With this particular use case it’s ok to have some false positives. The worst thing that can happen is that the inner for loop is run more times than necessary.
The final code is something like this:
bloom_filter = create_bloom(customers_in_country)
for order in orders:
if bloom_filter.contains(order.customer_id):
for customer in customers_in_country:
if order.customer_id == customer.id:
…
Edit: this comment probably contain many inaccuracies, as I’ve never done this kind of stuff in practice, so don’t rely too much on it.
You can, of course, feel free to show us how you’d implement this in python. It’s fine to say you would do it differently, but don’t stop there, show how/what you would do differently. Add to the discussion, like the person you were replying to did, don’t detract.
It’s just example code to demonstrate the idea of the optimization explained in the article. I also based my code on the code used in the article (and made some major changes to better fit my attempt of explanation).
I haven’t read the article but I work with Bloom filters at work sometimes.
Bloom filters basically tell you “this thing might be present” or “this thing is definitely not present”.
If you’re looking for a piece of data in a set of large files, being able to say “this data is definitely not in this file” saves you a bunch of time because you can skip over the file instead of searching through the whole thing just to figure out what you’re looking for isn’t there.
Neat. Most of that went over my head but always good to see more performance out of existing tech.
Basically it’s just an optimization of a double nested for loop. It’s a way to avoid running the inner for loop when it is known there will be no hit.
This is useful when we for example want to find all product orders of customers in a particular country. The way we can do this is to first filter all customers by their country, and then match orders by the remaining customers. The matching step is the double for loop.
Something like this:
for order in orders: for customer in customers_in_country: if order.customer_id == customer.id: …
Many orders won’t match a customer in the above query, so we want to single out these orders before we run the expensive inner for loop. The way they do it is to create a cache using a Bloom filter. I’d recommend looking it up, but it’s a probabilistic cache that’s fast and space efficient, at the cost of letting through some false positives. With this particular use case it’s ok to have some false positives. The worst thing that can happen is that the inner for loop is run more times than necessary.
The final code is something like this:
bloom_filter = create_bloom(customers_in_country) for order in orders: if bloom_filter.contains(order.customer_id): for customer in customers_in_country: if order.customer_id == customer.id: …
Edit: this comment probably contain many inaccuracies, as I’ve never done this kind of stuff in practice, so don’t rely too much on it.
That’s certainly not how I would implement any of this in Python.
You can, of course, feel free to show us how you’d implement this in python. It’s fine to say you would do it differently, but don’t stop there, show how/what you would do differently. Add to the discussion, like the person you were replying to did, don’t detract.
It’s just example code to demonstrate the idea of the optimization explained in the article. I also based my code on the code used in the article (and made some major changes to better fit my attempt of explanation).
that’s what’s in the article tho
I haven’t read the article but I work with Bloom filters at work sometimes.
Bloom filters basically tell you “this thing might be present” or “this thing is definitely not present”.
If you’re looking for a piece of data in a set of large files, being able to say “this data is definitely not in this file” saves you a bunch of time because you can skip over the file instead of searching through the whole thing just to figure out what you’re looking for isn’t there.