Carta Spotlight: Data Engineering

Troy Harvey
Building Carta
Published in
9 min readJan 25, 2023

It’s 2018. My fourth day at Carta. I’m sitting in the CEO’s office in SF alone with the departing Data Science Lead. We’re doing a typical handoff. When we get to the topic of version control, my soon-to-be-ex-colleague opens Google Drive and scrolls through a hundred Google Docs filled with SQL queries! I wasn’t surprised by the chaos. A year prior I had decided to transition from full-stack software engineering to data engineering to help solve problems exactly like this one. The data science boom in the early 2010s created significant value for companies like Carta, but there were problems scaling and maintaining the solutions. My inbox was flooded with recruiters looking for consultants to help clean up the mess. So I decided to go all in on data engineering. Four years later, using Google Docs for SQL version control is a distant memory. Carta’s Data Engineering team has matured into a specialized software engineering team.

You may not know much about data engineering. In a recent tweet, Gergely Orosz put Data Engineer at the top of the list of roles that “many software engineers know little about.” Barely 5% of respondents in the 2022 Stack Overflow developer survey classified themselves as a Data Engineer. What do Data Engineers at Carta do? In this post I will share some stories highlighting data engineering and the role we played in building Carta’s data platform — from data pipelines to production machine learning (ML) to Data Products.

What is a Data Engineer?

In the past decade, data engineering has matured beyond fragile Big Data Hadoop clusters, ad-hoc scripts, and cron jobs to leveraging software engineering tools, frameworks, and automation. In the diagram below you can see the transition from Big Data to cloud data warehousing as the price of a GB of data storage shrinks and the amount of data the world is processing and storing continues to explode.

The Data Engineering specialization coincides with the emergence of Cloud Data Warehousing.

The data engineering skill set cuts across DevOps, backend engineering, database administration, and automation engineering. Here’s our working definition of a data engineer at Carta:

Data engineers are software engineers who specialize in reliably consolidating, organizing, and refreshing Carta’s datasets to enable strategic decision making with traditional BI tools and product features backed by machine learning.

All companies need software engineers who understand data warehousing, databases, workflow management systems, monitoring, continuous integration/deployment (CI/CD), and ML. The software engineers working on the intersection of these tools are called data engineers.

Evolution of Carta Data Engineering

In the past four years, the scope of the Data Team expanded from ad-hoc reports to building a data platform. In the early days of Carta, reporting and insights worked like every other small startup. Henry would ask a simple question — “is there a problem with the invoice report this month?” And then a lengthy Slack thread would ensue: “Checking it out now. For some reason the script that runs it automatically wasn’t running.”

An actual Slack exchange from 2015 highlighting data quality problems at Carta.

We had no monitoring on our analytics stack. We were scrappy. At the time our “data warehouse” was stitched together by a high school intern using Postgres Foreign Data Wrapper. Our on-prem Looker installation would regularly run out of memory and fall over with no alerting. And when reports failed or data was missing the leadership team would rightfully question the quality of the entire fledgling analytics system.

At Carta, the data engineering function was born when our reporting needs started to outpace the spare cycles of a single friendly backend engineer. The first data engineering project was a Redshift data warehouse and s3 data lake. As Carta grew, the number of leaders who had seen high quality data rooms at other companies also grew. After a series of Looker outages, failed ETL scripts, and impossible data requests our CTO prioritized building new data infrastructure. We disconnected Looker from the foreign data wrapper, and connected Looker to Redshift.

So, back to that story about SQL in Google Docs. Carta’s data is complex and using Google Docs to “version” our data modeling work was inadequate. Around that time an innovative team in Philly had been hacking on a new cli tool that would bring git version control and templating to SQL projects. That team was Fishtown Analytics (now Series-D dbt Labs) and the project was dbt. I hastily put together a dbt demo and asked our very skeptical command-line averse team to “just try it out.” We were one of the first 200 dbt projects — running an alpha 0.10 version of dbt in production. The gamble on dbt paid off, allowing us to scale our data team as Carta grew, today our dbt project has 47 contributors and 2553 models.

Containerization was next for us. The DevOps team encouraged us to adopt their newly minted Rancher installation for deploying our data pipelines and our dbt project. Rancher was a misstep for Carta. It was difficult to predict at the time, but the open source community quickly converged on Kubernetes as the default container orchestration system. We pivoted quickly. A few months later, in early 2019, we were setting up a kOps cluster and starting the move to Kubernetes.

Carta’s Engineering organization completed the move to Kubernetes. And we use Airflow to orchestrate ELT workloads in Kubernetes pods. Running data pipelines in the same containerized platform used by all the other software engineers at Carta allows us to share ideas, code, and get expert support from Carta’s Production Engineering team. For example, databot is one of the most popular apps created by Carta Data Engineering. databot is a thin API around dbt that allows data scientists to run models across different environments right inside Slack. The full workflow allows our team to write SQL and run dbt locally in a sandbox environment, do code review and CI in GitHub, continuously deploy new data models to production and refresh individual models with a simple Slack message.

Running the mypy enrollment dbt model using Carta’s Slack databot.

With this solid foundation of containerized pipelines and reliable analytics dashboards in place, our data team grew and scaled to serve every team at Carta. Airflow has executed 1.3 million automated pipeline tasks for us in the past year. Carta’s Data Team has expanded to include data engineering, analytics engineering, data science, and machine learning. At the same time, the scope of Carta has expanded from a product to a platform. Today Carta manages $2.5 trillion in equity value for over 2 million stakeholders. Our Data Team is now working on two strategic goals beyond traditional analytics dashboards — Data Products and Machine Learning.

Data Products

Carta’s Funding Benchmarks Data Product

At Carta we have a data product called Funding Benchmarks — a dashboard style page with charts, graphs, and filters that allows users to compare their startup fundraising round with similar companies. The first version of the fundraising dashboard was backed by OLTP database queries. The code was filled with expensive database queries and data munging Python. The Valuations team had the idea to work with Data Engineering to move the number crunching into dbt SQL models and replace the transactional database queries with a new OLAP API backed aggregated datasets produced by dbt.

OLTP stands for online transactional processing. Think Postgres and MySQL — databases optimized for high throughput real-time operations. OLAP is short for online analytical processing. Think Snowflake, Redshift, BigQuery. OLAP databases are excellent at handling complex queries on large datasets.

When you see a dynamic chart or a dashboard with filters in a product the API backing that feature is a good candidate for an OLAP API. ORMs backed by a transactional database are excellent for CRUD operations, but when engineers build features that require lots of joins across tables, aggregating, or filtering data there are three common patterns:

  1. The first pattern is the most common for engineers that don’t know SQL — use the ORM to pull all the data into memory and reshape it into the format for the API. The main problem with this approach is performance. If the dataset is large, the API will be slow.
  2. Use a SQL query builder like the SQLAlchemy Query API to build a complex SQL query in Python. The problem with this approach is that it’s a complicated abstraction on top of SQL.
  3. Write raw SQL embedded in Python code. This works well for simple SQL queries with no aggregation, filtering, or grouping. If the SQL needs to be dynamic with sql injection protection and parameter binding this approach starts to fall apart.

Additionally, when the data needed for a report or chart comes from another service, the code is even more complicated. Fetch some data from the transactional database we can query directly, fetch some other data from an API, and merge the results. These problems are not new, and there are many great solutions (e.g. GraphQL), but in this case Data Engineering chose to build an OLAP API for fundraise benchmarks.

Architecture diagram of Carta’s Data Products backend

At Carta, OLAP APIs are one solution we’ve been using for analytics-heavy product features. OLAP APIs are a data product — a programmatic interface to documented, versioned, code-reviewed dbt data models. Providing a great dashboarding BI experience for these data models is the baseline. But product teams at Carta want to pull these aggregated datasets into applications. We started with a gRPC interface for backend engineers to query datasets produced by the data team. Almost immediately, frontend engineers requested REST APIs. So I went to work coding a separate REST interface for our OLAP APIs. Fortunately, Carta’s Infrastructure Team caught me reinventing the wheel during code review. With just a few lines of transcoder configuration, we added REST support to our gRPC endpoints. We now have consistent data across BI dashboards and data products using OLAP APIs to query the same underlying dbt data model. We have added 7 OLAP APIs in production with 11 contributors to the project.

Machine Learning in Production (MLOps)

Data products backed by machine learning will soon be in every software we use — ChatGPT, GitHub Copilot, and Stable Diffusion are just the beginning. Machine Learning engineers at Carta are constantly researching, prototyping, and iterating on models. For example, until recently our Launch team was manually scouring documents for information they needed to complete company profiles, so we helped an ML engineer launch a model to extract key fields from articles of incorporation. The Launch team can now upload a pdf to an API which returns the extracted data in a structured response.

Production-grade Machine Learning (ML) is an essential data team capability. Every organization is racing to figure out how to bring more ML applications online. Data engineers know how to build and deploy containerized backend services with automated test suites. Data engineers at Carta work closely with MLOps to rapidly take research prototypes from a rough draft to production. Typically data engineers enable the ML team by setting up gRPC boilerplate, CI/CD, testing automation, and Helm/Kubernetes configuration. This allows ML engineers to focus on improving models and running tests without needing to become DevOps experts.

A Call for Software Engineers

The stories in this post about our data engineering culture may be unique to Carta, but the problems are not unique. Data Engineering is an essential role. Standardization and improvements in tooling for data engineering has allowed Carta to move beyond internal dashboards and reports. The same datasets and ML models used for BI are now unlocked for engineering teams to build data products using OLAP APIs. But this doesn’t happen without software engineers.

Data Teams need more software engineers. If you are a developer reading this and the work sounds interesting, go talk to a data scientist or ML engineer about the problems they are facing. If you are a leader, go recruit DevOps and software engineers to join the data team. Data Scientists and ML Engineers are doing incredible work with applications beyond reporting and research. New infrastructure and automation is required to productionize that work for use in data products.

Data teams need engineers with experience in containerization, CI/CD, test automation, backend APIs, and SQL ORMs. Similar to ML, we need tool builders. The frameworks for data teams are much better than ten years ago, but we need to see the same kind of diversity that’s available to full-stack developers. If you are interested in a career in Data Engineering, check out Carta Careers. Send us a note to share your experience. What have we missed in this brief survey of data engineering?

--

--