Using Probabilistic Data Structures to Build Real-Time Monitoring Dashboards
As a provider of Competitive Intelligence as a Service to retailers and consumer brands, DataWeave aggregates and analyses hundreds of millions of products’ data from the Web each day. Once aggregated, this data is fed into a complex process of extraction, transformation, machine learning and analyses. These operations are performed on a consistent basis to provide our customers with easily consumable and actionable insights in near real-time.
However, as we keep increasing our coverage of products, there’s more stress on our already resource-hungry infrastructure. As a result, we’re always on the lookout to improve our data processing efficiency, and ensure the most judicious use of our resources.
One of the ways we achieve this is by making sure only newly acquired data points go through a large part of our data transformation and machine-learning process. This requires us to be able to de-duplicate the acquired data points from the ones already present in our historical datastore, to optimize our operations. By doing this, we significantly reduce complexity and cost, and at the same time, improve our efficiency.
Identifying newly acquired data points, though, is a challenge in itself.
For smaller data-sets, consisting of about 1–2 million data points, a common way of identifying newly acquired data is by indexing the data in a database, and then checking if an element is present or not. This method, however, fails at the scale that we operate in (hundred million or more data points each day). Also, the associated costs could overrun in the blink of an eye.
On the other hand, our approach is, instead of storing the entire set of data points in the database, only a summary of the data is stored. This method is called Sketching — a technique used in Bloom Filters. While the complexity of the operations here is reduced to O(1), there is a slight trade-off with accuracy.
Probabilistic Data Structures — A Deep Dive into Bloom Filters
Bloom Filters are space-efficient data structures that are used to find out if a given element is present in a set or not, and has a constant time and space complexity for performing operations on the data. Since we store only a summary of the data, an error rate is implicit with Bloom Filters.
(Fun Fact: It was conceived by Burton Howard Bloom in 1970)
The user typically has to make two crucial design decisions while initializing a bloom filter:
- Cardinality: It is the maximum capacity of a Bloom Filter. A higher cardinality will result in a bigger memory footprint.
- Accuracy: The ideal scenario is to have 100% accuracy in an algorithm. However, this would require a space complexity of O(n). An alternative is to decide on a tolerable rate of error. Lower the accuracy, lower is the memory footprint and faster is the algorithm.
Properties of Bloom Filters
- Set operations like Intersection and Union of two or more bloom filters can be performed
- False Positives (element is not present but the Bloom Filter indicates it’s present) can occur but never False Negatives (Bloom Filter indicates element is not present but element is present in the set).
- Operations for element lookup can be parallelized.
- If the cardinality of an element is more than the configured cardinality, the error rate increases.