Scaling Elasticsearch Percolation for Government Data

Published in

FiscalNoteworthy

6 min readOct 22, 2021

At FiscalNote, we help connect people and organizations to their governments. We do this by collecting public information about the entire policy-making process, from grassroots advocacy campaigns to state legislation to the federal regulatory rulemaking process. We have teams of engineers building scrapers responsible for the ingestion of terabytes of unstructured government data in real-time.

Our team — the platform team, responsible for our backend python services — then makes all these bills, regulations, and documents highly searchable and alert-capable. Building a search and alerting system on top of messy government data turns out to be a pretty big challenge, even with powerful tools like Elasticsearch at our disposal.

In this post, I’ll walk through how we use Elasticsearch’s percolate functionality to give our clients the most up-to-date and targeted alerts possible.

Government data can be tricky

Common Data Models

Government data is tricky, especially in the U.S. If you don’t believe me, go check out your local regime’s website. Governments simply aren’t in the business of slick websites with clean, normalized data; we have to create common data models on top of this diverse legislation. Furthermore, U.S. and international legislations have different processes that we need to account for in our collection and processing efforts.

Scale

FiscalNote ingests data from governments all over the world — not just the U.S. — and so we need to bake foreign language support into our architecture. This is non-trivial since our clients want to work in their language of fluency. We might have a Canadian client who wants to set up French alerts, but to discover new U.S. bills — written in English, and related to nuclear energy.

We also ingest a lot of data, across thousands of jurisdictions. On any given day, we ingest 10000s of updates to bills across 30 countries.

Alerting at Scale

Clients also have a ton of alerts (over 100k and growing), since they generally optimize for detection and can accept having some false positives (so long as they don’t miss the one bill that will have existential implications for their field).

Our engineering team quickly realized that we needed a way to cross-check all those alerts against every new thing we ingest.

Enter Elasticsearch.

What is Percolation?

A brief primer on Elasticsearch

It turns out that building fast and accurate searches at scale is hard. So hard that unless your core product is a search engine, you’re almost certainly better off using something someone else has already built to power your search engine. We use Elasticsearch (ES), a popular search engine built on top of the even more popular and open-source Lucene library.

At its core, Elasticsearch lets you pretty easily index documents and search over them with the Lucene query language. It works well for us:

Our datastore is constantly growing in total size, which is easily addressable in ES by simply adding more hardware (data nodes)
Our datastore also grows more horizontally, as we add more varied datasets. We can easily create new indices (think database tables) in the same cluster for data that might need its own store.
ES supports analysis plugins across a ton of different languages (which is important because operations like stemming are fundamental to search engines, but vary based on language)
ES provides a user-friendly JSON API on top of all the more complex Lucene layer, so is overall just easier to use
The underlying Lucene library allows us to index the text in a smart way allowing us to support more advanced search logic such as fuzzy matching, word proximity searches, and custom search result scoring

Alerting

What about alerting? Enter Elasticsearch’s concept of percolation. There’s a common search architecture pattern: Build a query, use that query to search over some indexed data, and return a list of “hits” that match the query.

There’s a common alerting pattern, too: alerts are just queries, and you could have a list of queries run on a schedule against your search index. You could consider any matches to be valid alerts, and send them off to the user. There are two main problems here. Firstly, this design makes it difficult to keep track of things of which you’ve already been alerted — it’s on a schedule that hits the same data repeatedly. Second, it scales poorly when you have growing alerting requirements like adding text highlights or trying to alert in near real-time. Running a few thousand searches with full highlights every few minutes can warm up your cluster fast.

Percolation flips the order. Instead of taking one query and hitting a giant datastore of documents, percolation takes one document and cross-references it against a store of queries. Much like water percolates through a filter filled with coffee, documents percolate through an index filled with queries. Instead of returning a list of documents that match the query, percolation returns a set of queries that match the document. It’s flipped!

As a side note, percolation works well when you have a large number of queries; percolation shines when your ratio of queries to documents is high. However, if you only have a few queries a high volume of documents, it might make more sense to just run scheduled queries instead of adding the overhead of a separate percolation pipeline.

Percolation at FiscalNote

Our implementation

This concept works brilliantly for our use case. All those alerts our users rely on? They’re just “percolators” (queries) stored in their own ES index. All those languages we need to support? We can just use the Google Translate API to translate the queries into whatever language we want, and ES language-specific analyzers will do the heavy lifting. When we ingest a new document, we just need to “percolate” it over our query datastore. Should that document match a percolator query, we know we’ve found a matched alert!

Tips for percolation

Over the years, we’ve refined our implementation into something a little more complicated than the above, but not by much. Here are some lessons we learned the hard way:

Depending on traffic, percolation probably deserves its own cluster

In our initial implementation, the same ES cluster powered both our classic search experience as well as alerting. The simplicity was nice since our search and alerting pipelines operated over the same basic data. However, over time we noticed a fundamental scaling difference that ultimately led to us migrating alerting to its own separate cluster. Basically, percolation and search scale in different ways, and at different rates depending on the time of the year (it turns out governments tend to be seasonal, and aren’t keen to give up their summer recess). Search traffic increases linearly with users-initiated search requests and is RAM/disk heavy. Our users are somewhat more active during sessions (periods when the government is actively pushing out legislation), but since we’re B2B we’ve never had the same kind of user-scale problems B2C software has. On the other hand, alert traffic scales linearly with our ingestion pipelines and is CPU/RAM heavy. When our scrapers find a lot of new content (also during sessions), the percolation side starts to get hot. We needed to be able to scale these two sides differently, so ended up splitting them out into different clusters.

Shared data models are your friend

Since percolation and search can use mostly identical queries, it makes sense to share as much of that logic as possible. Doing this lets you design fun features, like a button to turn a search into an alert — with very little logic in the backend.

For example, if the data you’re indexing into your search index has some field called “first_name”: “some_string” and you have some query that targets a search to that field, you can reuse the same query structure in your percolation query as long as the underlying data follows the same models.

Disable highlighting

Highlighting — a word that here describes the ES functionality to insert tags around the pieces of text that matched a search — is very expensive. It might make sense to design your architecture to disable highlighting on percolation, and only later highlight on documents that matched.

Disable high proximity and wildcards

Wildcard operators (*) and high phrase proximity (> ~50) can slow things down very quickly. They also rarely lead to higher-quality search results, so use them carefully.

Relying on clear stemming rules for fuzzy matches is, in our experience, a far better approach.

Jake is a software engineer at FiscalNote, where he works mainly on our backend search and alerting services.