Why FiscalNote Chose Apache Flink for Data Pipeline Orchestration

Jose Vargas
FiscalNoteworthy
Published in
5 min readDec 13, 2021

To help connect people and organizations to their governments, FiscalNote’s data engineering team maintains more than one hundred scrapers that continuously collect legislative and regulatory data. Instead of sending all their data at once when finished, most scrapers will send data as soon as it is in a form that can be independently processed. This streaming, item-level data production isolates errors and is a good fit for functionality that users expect, such as alerting when bills match their search criteria.

However, the team has faced two high-level challenges:

  • Web scraping, an inherently brittle activity, was coupled with downstream processing.
  • Within the context of automated web scraping, it is difficult to achieve data completeness with low latency collection and surfacing, all requirements that users expect from our data.

For example, our automated system may fail to match one out of ten bill sponsors or be unable to summarize legislation from highly unstructured documents.

As a result, the team wanted a system that could:

  • Decouple our scrapers from downstream processing by indefinitely storing and caching scraped data.
  • Reprocess individual datums on command.
  • Account for any schema changes introduced between the last time a datum was received and the time that the datum needs to be reprocessed.
  • Supplement and augment data from the scrapers by other means, such as manually entered data and analytics and summarization provided by our data science team.

This blog post will cover how the team has started to address these challenges by building an event-driven application with Apache Flink — plus, we’ll cover a couple of lessons we’ve learned along the way.

Why Apache Flink?

The decision to use Apache Flink for this system came after considering other possible open-source data orchestration systems, such as Apache Airflow, Nifi, Kafka Streams, and Apache Spark. All these systems are able to express data flows via directed acyclic graphs, which is a useful mental framework when thinking about the declarative operations that we wanted to support on our data. We quickly realized that Airflow was designed for batch processing, which didn’t fit with the continuous, item-level data production from our scrapers. Nifi isn’t intended for the declarative, per item workflow that the team needed, while Kafka Streams required infrastructure that we don’t currently have (we use RabbitMQ as our message bus).

We also prototyped an in-house data flow API, but ran into issues that Flink already addressed, such as state schema evolution and distributing tasks from a single directed acyclic graph between processes. Flink’s keyed state APIs were a natural fit for the data access pattern that the team expected, while Flink’s ability to scale to hundreds of gigabytes of keyed state with RocksDB, gives us more than enough runway to keep data around indefinitely. Meanwhile, Flink’s datastream API, state serialization framework, schema evolution via Plain Old Java Objects (POJOs) and Apache Avro, and rich connector ecosystem allowed the team to focus on implementing the declarative workflow that allows analysts and automated systems to enrich scraped data.

What does the Flink job look like?

There is a primary data stream that is continuously populated by the scrapers. Next to it, there is a data stream where different command messages are published. The command messages are how external systems can declaratively tell a Flink job what it should do with one or more datums. For example, an analyst may add a bill summary or increase the specificity of an otherwise ambiguous sponsor name via a content management system. The data from the content management system is then transformed into a command message and routed to the Flink job. Once received, command messages are hashed and applied to the respective datums. This application step is where a lot of the work actually happens; looking up data received earlier from RocksDB, augmenting data, and so on. Once all command messages are applied, the scraped data is emitted and sent to downstream systems.

High-level Flink job diagram

What did we learn about Flink?

Fast versus slow data streams can cause problems.

One interesting characteristic of this workflow is the fast versus slow nature of the two data streams. Our scrapers run and produce data continuously, while analysts clock in and out, and a particular fix may involve multiple phone calls with different offices in a state legislature. This duality revealed unintended behavior in Flink’s RabbitMQ connector, whereby Flink’s stop command would hang indefinitely unless a message arrived in every RabbitMQ queue consumed by the Flink job after the stop command was issued. This meant that stateful upgrades were not possible without sending sentinel values to the slow datastreams. We reached out to the Flink community, which very rapidly triaged the problem and implemented a fix via FLINK-22698.

Flink’s built-in PojoTypeInfo class can be used to support message tracing in a generic manner.

Our scrapers will attach JSON encoded metadata about a scraped item via RabbitMQ headers. This metadata allows us to, for example, tie errors that happen downstream to the scraped item that caused the issue. Since we were introducing Flink in between our scrapers and existing systems, we had to come up with a way to support passing along this metadata across all the operators in the Flink job. This was no small challenge as we wanted to do so in a manner that could be reused by different Flink jobs processing different data, and because operators within a single Flink job could be running in distinct processes on different machines.

The solution we came up with was to create a generic container class, and tell Flink how to serialize and deserialize this class via a custom TypeInformation object that internally, uses Flink’s PojoTypeInfo class. This meant that we didn’t need to create a custom serializer and we could leverage the POJO serializer’s built-in schema evolution. This was a good fit, as for any given Flink job, instances of the generic container class are effectively a POJO. Below is some sample code that shows how we used Flink’s PojoTypeInfo class. We confirmed that this works by disabling Flink’s fallback Kryo serializer.

The data engineering team is excited to continue to grow and support our data augmentation capabilities, and see what other features are possible with Flink. Ultimately, blending analyst and machine-generated information improves our ability to serve our high-quality, custom insights to help users better understand the impact that governments are making.

For more FiscalNote posts, subscribe to our F(N)novate publication.

--

--