How CBI Built Best-In-Class — And Low Overhead — Data Pipelines — CB Insights Research

Published in

CBI Engineering

9 min readFeb 6, 2020

Our new self-reporting data pipelines needed to fulfill a number of important goals. Here’s how we tackled the challenge.

CB Insights’ value proposition relies heavily upon the data we ingest and process. This includes extracting funding data or business relationships from many thousands of news articles daily.

Our approach is based on micro-services that do specific parts of this data extraction and pass data around to each other using message queues and databases. However, due to the distributed nature of our micro-service architecture, we often struggled to keep track of the processes performing data ingestion.

Tribal knowledge was siloed in the minds of certain individuals, which slowed down troubleshooting. We would find that data was not arriving where we expected, but it was tough to trace back to find the broken link in our chain of processing. We often lacked a shared picture of the entire system.

Additionally, siloed knowledge made on-boarding new engineers more difficult, and led to hidden redundant processes throughout our system. In order to remedy this situation we have built a comprehensive visualization and monitoring solution.

The new Data Platform needed to satisfy the below criteria, which include products goals and functional goals. Product goals reflect the user experience when exploring the Data Platform, and functional goals impact our software development practices.

Product Goals

Track and visualize the flow of data between data stores, jobs, services, and messaging queues.

Democratize institutional knowledge
Accelerate troubleshooting
Reduce onboarding overhead
Expose redundant processes

2. Augment and automate monitoring and alerting

3. Allow grouping of components into personalized data pipelines

Functional Goals

Zero manual data entry
Zero performance overhead introduced to our software
Does not require a change in our current development practices
Lightweight stack that requires minimal administrative overhead for our DevOps team
Supports both batch and event-driven data processing

Note: We define event-driven software as that which listens to a queue for messages to process rather than being activated on a time-based schedule. In general, event-driven architecture fits our use cases better. It makes our systems more resilient and less resource intensive.

Somebody Must Be Doing This Already

With the growing importance of data governance in the enterprise we figured that there must be an existing solution that meets our requirements. So, we divided and conquered a series of proof of concepts (PoC).

Each PoC included visualization and monitoring of one of our existing data pipelines. To avoid getting bogged down in this phase, we split up the work by assigning a product per engineer. Each engineer documented pros and cons, quickly presented findings to the rest of the data engineering team, and then moved on to the next one.

None of Them Quite Fit

All of these solutions would require us to do at least one of the following:

Considerably change our job development practices so that code originates and always lives inside a chosen product
Dedicate more time for our DevOps team to administer a new stack
Exclusively use batch processing

Nonetheless, these PoCs were well worth the time spent investigating thanks to the inspiration we received from our favorite aspects of each product. So, we designed a hyper lightweight system that emulates the most suitable features of these solutions, and fits snugly into our existing stack and practices.

The Vision

This project’s success was rooted in the engineering team taking ownership of its own efficiency. By actively collaborating on standardized utility libraries, the team set a foundation for a network of self-reporting micro services. This automated “registration” saves us from tinkering with GUIs and allows us to focus on building efficient systems. Decoupling registration from job schedulers and container orchestration enables platform-agnostic monitoring and visualization of scalable streaming style pipelines.

Our Design

We largely tried to emulate Apache Atlas by using a graph-based visualization system, a graph database backend, and a strong typing system. In the future, we strive to emulate more of the control plane features of Apache NiFi.

That being said, we are not very interested in the plug-and-play Lego-like approach of components dragged and dropped on a UI. Most of our jobs are NLP intensive, requiring more complex software than would be available without custom code. That means our Data Platform needs to know about the software we write and how it interacts without slowing down our developers or hurting performance.

This led to an instrumentation-based approach, where utility libraries report back to the mothership. The secret sauce lies in this functionality being abstracted away from developers so it all seems like auto-magic. This functionality is captured in the yellow box in the architecture diagram.

We broke the Data Plaform’s architecture into four main sections which will be described below: component and pipeline catalog, component registration, monitoring, and front-end dashboard. These four sections are tied together with a backend service written in the Go programming language.

Component Catalog

The component catalog is where we store all the metadata about the pieces of our data pipelines and the pipelines themselves. Since data pipelines are inherently similar to graphs, we decided to use a graph database to store this data. Graph databases focus on storing information about relationships between entities, and they make it easy to query for all the components that a given entity is connected to.

In order to keep the stack as manageable as possible, we chose AWS Neptune as our graph database. We like that it interacts with the standard graph database query language Gremlin (that is the little green guy in the diagram above). Thanks to Amazon, the burden on our DevOps team is minimal, though we do have to pay a fee for this managed solution.

The graph database is home to our typing system. However, most graph databases, including Neptune, do not support a built in schema so it is up to our software that writes to the database to maintain strong typing. Here are the types of components within our catalog.

Registration Utilities

All of these components and their interactions need a way to get into the catalog. As stated in some of our functional goals, the registration utils (shown as a yellow box in the architecture diagram) cannot introduce new hurdles to our engineers or performance bottlenecks to our software.

Registering Component Connections

As shown in the architecture diagram, CB Insights’ standard Go and Python utility libraries are the cornerstone of our registration system.

If you do not have standard utility libraries at your organization, I cannot encourage you enough to get started on them. They provide a uniform way for developers to use common functions without reinventing the wheel.

Active development and group ownership of these libraries leads to interfaces that suit your use case and continuously reduce boilerplate code. They work best when contributions are frequent and originate from all around the engineering organization, and those contributions are complemented with lively discussions.

For those of us developing the Data Platform, this presented the perfect place for us to insert our hooks. We added calls to registration functions inside of functions that read and write from SQL databases, s3, messaging queues, notification topics, etc.

At CB Insights, we use GRPC for inter-service communication so we added a registration hook in GRPC utility functions as well. All of these registration functions use GRPC themselves in order to communicate with our Data Platform backend which then writes to our Component Catalog. Keeping all of these registration functions under the hood of our util libraries allowed us to satisfy some of our functional goals.

Maintaining Pipeline Performance

In order to make sure we introduced zero performance overhead to our software, and avoided slowing down all of our software, we made all registration functions asynchronous. In Python, we achieved this through our custom @background.task decorator and in Go through goroutines.

We do not want to introduce new headaches to our engineering team, so all of these functions should fail silently from the perspective of the developer working on a job or service, while sending up alerts to the Data Platform owners. In this way, the data engineering team can diagnose issues with the data platform without creating log messages or code failures for those writing services.

Taking Inventory

In addition to our utility libraries, we added a step to our standard Jenkins pipeline to register new processors as they are deployed to the production environment. We also have a job monitoring our inventory of AWS resources like SQS, SNS, and S3.

Monitoring

In order to leverage all of the new data we are collecting to create useful alerts, we wrote a service that inspects the logs in elastic search, the status of our deployed containers, and the volume of messages in our queues. This allows us to provide red, amber, or green status to every component, and then aggregate these health scores for pipelines. Our frontend dashboard then provides pipeline health at a glance. In addition, pipeline owners are sent email alerts when a component in their pipeline is unhealthy.

Frontend Dashboard

From the frontend, users can drill down on specific components to see all of their connections and health. Custom pipelines are created through the front end which can provide suggestions based on the links in the graph database. Here are a few screenshots:

Click to enlarge.

Didn’t Think of That…

As we ramped up adoption of the Data Platform, it received of traffic. Especially registering SQL query strings (which you can see we store in the second screenshot).

To deal with this, we had to add in throttling and caching to make sure that we did not overwhelm the backend service or the graph database. A horizontally scalable mothership is key. Similarly, the graph database will generally be under heavy load. However, due to the growing adoption of Gremlin query language we can swap out graph databases if desired.

What’s Next

Considering that this entire platform was developed in conjunction with demands on our team for productionizing new datasets, fixing bugs, and releasing code templates, I am very proud of what we have produced so far.

But the work is never done! In the next versions we hope to include more job control features (stop, start, pause, retry), performance and resource consumption monitoring, and, most ambitiously, a way to track specific pieces of data as they travel from the outside world into the CB Insights frontend.

If you are trying to solve a similar problem, I hope you can learn something from our journey. We would love to hear about any lessons that you have to share.

Interested in joining a team tackling tough data platform challenges? We are hiring!

Originally published at https://www.cbinsights.com.