How FiscalNote is Leveraging a Data Lakehouse to Accelerate Integration from M&A

Hiancheong
FiscalNoteworthy
Published in
7 min readNov 23, 2021

Since the beginning, FiscalNote’s mission has been to use data to connect organizations and people to their governments. In pursuit of that mission, FiscalNote has acquired over ten companies in the past three years as we’ve sought to build a portfolio of data-driven products and services to grant our clients a 360-degree view on geopolitical events and government actions from the local to international level. This has included capabilities for policy monitoring, news and analysis, grassroots advocacy, ESG measurement, constituent services and professional community building.

FiscalNote’s M&A strategy has helped us grow quickly as a business and enter into new markets and further help our clients be aware of impactful policy changes, but it also offers FiscalNote an opportunity to leverage the data from across these disparate products in order to gain deeper internal insights into user activity and to experiment with new data-driven products and services. Data is the new ‘oil’ of the 21st century but, much like oil, raw materials aren’t enough without the logistics and infrastructure to support sharing, processing and analyzing that information.

Every new product comes with its own distinct tech stack that is well suited to their product and it’s not worth it for us to rewrite an entire tech stack for the sake of platform consistency. That means we need to find a way to integrate our data across products in a way that doesn’t assume rewriting any underlying tech systems, changing any existing process flows or completely overhauling existing product roadmaps.

Our Data Integration Challenges

Data integration comes with a ton of challenges before we can even begin talking about rows and columns. These challenges include:

Multi-Cloud + On-Premises Integration Networking

Enabling web traffic across different cloud and on-premises virtual networks can be a big challenge, especially if there are overlapping private IP ranges used by the newly acquired company and previously established network setups. These challenges are not insurmountable, but they do serve as an obstacle in enablement for even accessing relevant data.

Data Discovery — Data Catalog + Dictionary

Knowing that some data exists and is a good fit for an integration use case is an obvious first step, but in an ever-growing organization, this data discovery is non-trivial. Even after enabling access to it, ‘discovering’ the data and understanding what the data represents, how accurate/clean it is and what the actual data model is are essential to enable proper integration.

Cross-Team Collaboration

Enabling cross-team work by providing them with the right collaboration tools and processes is as important to the integration effort as the actual data systems. This is especially true for a remote workforce where sharing even basic snippets of code to help explain how a database is set up can be painful because of development environment differences between teams.

Access Controls and Compliance

There are plenty of instances where newly acquired data will contain potentially sensitive information about users and there may already be established policies/patterns for how to access this data responsibly. In a large organization, access controls for data and maintaining data compliance standards (ex: SOC 2, FedRAMP, GDPR in Europe) become an important problem to account for given the larger number of potential data consumers.

Databricks

All of these challenges were in mind as we looked for a set of vendors and platforms to enable us to scale our data integration efforts. We quickly identified Databricks and their vision of a ‘data lakehouse’ as a strong solution for our needs. In particular, Databricks and lakehouse concept offered us some of the following advantages:

Flexible Data Storage Paradigm

Databricks’ tools don’t require all data to be hosted in a single file format or even in the same data storage location. For our lakehouse, each distinct product has a dedicated section (e.g.: either a dedicated AWS S3 bucket or dedicated path in an S3 bucket) that is used to store copies of data in a variety of formats, including csv, parquet, and json. This allows us to use a variety of tools for copying the data from the ‘source’ systems into the data lakehouse. We often have to use a range of tools or methods for moving data depending on what fits the use case best given data latency and data locality requirements. This also allows us a lot of flexibility to quickly integrate data regardless of its native formats or storage systems and data locality.

Integrated Catalog and Access Controls

Databricks comes with its own Hive-based metastore for helping track/organize data stored in S3 and can bring along with it a variety of mechanisms for setting Access Control policies so we can maintain best security practices while also enabling easier data access and discovery. In addition, the flexible nature of Databricks’s clusters means we’re also able to integrate it with other tools such as AWS Glue’s metastore or linking the Databricks user groups to our entitlement groups defined in Okta.

Collaboration Environment

Collaboration across disparate teams can always be tricky since at best it usually involves getting a lot of IT permissioning set up and usually implies one group needs to work ‘inside’ the infrastructure of another group and pay a ramp-up cost. Despite us living in the era of ‘everything in the cloud’, shared cloud based development platforms are still the exception instead of the norm, which is why the ability to have shared notebooks hosted on Databricks is a huge advantage when we want to share snippets/examples of code to help illustrate how to access/interpret data in our lakehouse or even when sharing repeatable analysis notebooks. The collaborative notebooks have been especially useful as our workforce has been largely remote during the COVID-19 pandemic. For example, it can greatly accelerate how quickly we can debug issues with our data analyses or ML model training.

Managed Spark for Streaming and Large Scale ELT

The initial FiscalNote application always had a lot of data, but never quite in the realm of ‘big data’ in terms of sheer volume. We had measured things in tens of millions of database rows and in hundreds of gigabytes. With our recent approach to M&A, the sum of all of FiscalNote’s data has reached the scale where we need ‘big data’ tools such as Apache Spark to help us move around our data and process it. Databricks’ managed Spark platform is helping us quickly build out the data engineering components we need to handle both data streaming and batch data processing tasks.

Controllable Cost Scaling

Many cloud-based systems live in their own ecosystem/deployment environment. Since that means reading data in and out of your own cloud network into theirs, there is a high likelihood of incurring a range of different data ingress/egress costs and security risks. Furthermore, you’re fairly limited in how you can control the costs of usage as few but the biggest cloud services providers have any budgets/price quotas you can set up. Databricks’ latest ‘E2 Deployment’ model means that the managed Spark clusters are deployed inside our own AWS account and we have the ultimate flexibility over the EC2 instance types used and their related costs and the permissions granted to those servers for enhanced security.

M&A Integration

The cloud-native nature of our lakehouse and the tools we’re able to deploy on top of Databricks have allowed us to quickly enable data exploration and basic forms of data integration from many of FiscalNote’s recent M&A activity. We now have a clear ‘playbook’ for how to enable access to newly acquired data (ELT to lakehouse) and enable several different teams to collaborate together on further data integration efforts. The lakehouse is accelerating this exploration exponentially. Simultaneously, Databricks enables us to apply many of our ML models across new datasets and train new models off of the joint data, enabling us to unlock the full potential of our NLP systems across the new data that we integrate from new products. The advantages of using a common platform located in the cloud that provides near infinite scalability has enabled FiscalNote to start taking advantage of our M&A work within a few weeks of closing the deal and is opening the door to brand new data products and services.

Our Future

Our work on data integration is not over. We are at the end of the beginning, not the beginning of the end. Going into 2022 we plan on further investing in our data integration efforts beyond our lakehouse including building self-service platforms for enterprise-wide data engineering, data cataloging, and machine learning model deployment. We’ll have a lot more to say about our data integration efforts in upcoming blog posts.

--

--