Build or Buy? Lessons From Ten Years Building Customer Data Pipelines

Jun 9 · 9 min read

“Before RudderStack, I tried to build customer data pipelines inside a large enterprise using homegrown and vendor solutions. This article summarizes what I learned both building and buying customer data pipelines over the last ten years and how I would approach the challenge today.”

- Soumyadeb Mitra

A major initiative for all modern companies is aligning every team around driving a better experience for their users or customers across every touchpoint. This makes sense: happy, loyal users increase usage, business growth, and ultimately revenue.

Creating powerful experiences for each user, especially when it comes to use case personalization, is easier said than done.

To drive great experiences across every team and technology, you must:

  • Collect every touchpoint each customer has with your business
  • Unify that data in a centralized, accessible repository
  • Enrich and deliver that data to every downstream tool (product, marketing, data science, sales, etc.)

Overlooking any of these measures makes eliminating bad user experiences nearly impossible. Your marketing and product teams will continue to send email promotions for products your customers have already bought, and data science won’t be able to produce models that increase lifetime value.

With customer expectations higher than ever, failing to unify and activate your customer data means slower growth and higher churn.

The Big Decision: Build or Buy?

I’m an engineer by trade, and before starting RudderStack, I spent years at a large enterprise trying to unify and activate customer data with both homegrown and vendor pipeline solutions.

Over the last two years, I’ve worked with 20+ engineers to build customer data pipeline solutions, and based on those ten years of experience, here’s why I think trying to build your solution is a mistake.

The Challenges I Faced in Building

As is often the case, though, the idea of building something is different from reality. Here are the things I wish I considered before going down the build path:

Scaling is a Challenge

Ultimately, you want to track every action your users take, from clicks to searches to transactions. That event volume for even a mid-sized B2C company can easily get into millions of events per day.

First, building any system to handle that scale isn’t trivial from a collection and processing standpoint, and because the data is driving actual experiences, latency can cause major problems.

Event volume peaks are another major issue. Planning for average usage is one thing, but the system needs to handle significant spikes in volume without any additional configuration or infrastructure management. (Otherwise, the engineering team is always putting out fires whenever the business has a big event like a sale, peak season, etc.).

If you don’t build the system for efficiency at scale from the outset, costs can also become a major problem.

Managing these considerations while ensuring low latency and minimal performance overhead almost always means engineering resources are being used to collect and route data instead of working on the actual product and product-related infrastructure.

Building and Managing Data Source Integrations is an Annoying, Never-Ending Problem

  • Website
  • App
  • Cloud tools
  • Payment/transaction data
  • Messaging data (email, push, SMS)
  • Advertising data (Google, Facebook, etc.)
  • Cloud apps (CRM, marketing automation, etc.)

Building integrations for every data source is not only a huge amount of work but requires knowledge of a diverse set of programming languages, libraries, and application architectures. Worse yet, the integrations are constantly evolving with new features and API versions.

Another major challenge is that there are different categories of pipelines. Event data comes from the website and app and needs to be delivered in real-time, but customer data from cloud sources is often tabular and needs to be loaded in batches.

Keeping an engineering team motivated to keep messy integrations up to date is a huge challenge, not to mention concerns around the value of using expensive, smart resources to maintain base-line functionality.

Unifying Data Isn’t as Easy as Dumping Customer Events Into Your Data Warehouse / Data Lake.

Well, it turns out that it is hard. You must:

  • Handle millions of events per hour while managing warehouse performance and cost
  • Manage schema changes as your event structures change (which happens all of the time with customer data)
  • Dedupe events
  • Maintain event ordering
  • Automatically handle failures (during warehouse maintenance, for example)

There are other challenges, but even that small list shows you how complex the actual architecture is.

Every Downstream Team Wants Data Integrated Into Their Tools Too

For example, your product team will want event stream data in analytics tools such as Amplitude or Mixpanel. The marketing team will want those events in Google Analytics, and Marketo and the data science team will want it delivered to their infrastructure.

As I said above, managing these integrations is a full-time engineering function.

More importantly, event stream connections (source-destination) aren’t the only pipeline you have to manage. Downstream teams are increasingly requiring data enriched in the warehouse for use cases like lead scoring, win-back campaigns for customers likely to churn, and maintaining a single view of the customer in each tool. Building pipelines from your warehouse to each downstream tool is a significant engineering problem in and of itself.

Dealing With Privacy Regulations is Complex When There are so Many Sources and Destinations

Not only is the exercise hard and boring, but deciding to build that functionality into your internal pipeline product significantly increases the complexity and extends the project into building internal tooling.

Unfortunately, you don’t have a choice on this one — in the era of GDPR, CCPA, and other regulatory measures being put in place, non-compliance can lead to millions of dollars of fines.

The Complexity of Error Handling Scales With the Number of Integrations

The summary is that error handling is hard. Think about it: any of your sources, destinations, or warehouses can be down for any length of time. (Keep in mind many tools, like the warehouse, are sources and destinations!)

The reality is that your system needs to ensure you don’t lose events in any circumstance and maintain the event order for future delivery when there is a problem.

The Challenges I Faced With Buying

It’s Not Just Vendor Lock-in Anymore; it’s Data Lock-in

Vendors who store data leverage it to drive a few specific features well, but that never serves every need you have to fill for downstream teams. For example, pipelines built for enterprise-scale don’t enable real-time use cases, while pipelines that serve marketing teams fail when ingesting tabular data from cloud sources. Neither serves data science well.

Lastly, and this is a personal pet peeve, paying a vendor to store a copy of data you’re already storing in your warehouse is an unnecessary cost in an already expensive system.

Most Vendors Don’t Support Complex Customer Data Stacks.

Central management would be great, but buying a solution for each part of the pipeline means managing multiple vendors. Also, there are challenges of varying data formats and system-wide control for both data governance. Furthermore, it isn’t easy to suffice even more complex needs such as identity resolution.

The Cost Outweighs the Benefits

When a vendor manages both the processing and storage of your customer data, they up-charge you on their costs. Even for a moderately sized business (a few million users or a few billion events per month), to operationalize the tool, the cost could be $250,000 — $500,000 annually plus all of your internal costs.

Building yourself has massive hidden internal costs, but paying a vendor half a million dollars a year is a hard investment to justify, especially when it takes a while to get results.

What I Would do Today: Implement Warehouse-First Customer Data Pipelines

A warehouse-first approach puts your owned infrastructure (warehouse, systems, etc.) at the center of the stack, so you own and have full control over all your data but outsources the parts that don’t make sense to build with internal engineering resources. Importantly, warehouse-first pipelines don’t store any data; they ingest, process, and deliver it.

Said another way, warehouse-first customer data pipelines allow you to build an integrated customer data lake on your warehouse. You don’t have to build the plethora of source and destination integrations or deal with peak volume or error handling. Instead, you retain full control over your data and the flexibility to enable any possible use case.

There are many benefits to the warehouse-first approach, but here are the top 3 based on my experience:

  • You can build for any use case: Instead of being limited to vendor-specific use cases, owning your data with flexible pipelines means you can enable all sorts of valuable use cases. You own everything, from driving real-time personalization on your website to delivering enriched lead profiles from your warehouse to your sales team.
  • You can deliver better data products: If you have flexible schemas, event stream pipelines, tabular pipelines, and warehouse → SaaS pipelines managed for you, your team can leverage the power of unified data in your warehouse to build creative and valuable data products for the entire company (instead of building and managing infrastructure).
  • You don’t have to deal with vendor security concerns: Because your warehouse is the central repository for data and the vendor doesn’t store any data, you can eliminate most security concerns common among 3rd-party vendors who store your customer data.
  • You can decrease costs: Quite simply, you don’t have to pay a vendor a premium to store data that already lives in your warehouse.

I Built RudderStack to Make it Easier For You to Build Better Customer Data Pipelines.

Sign up for Free and Start Sending Data

This blog was originally published at

Plumbers Of Data Science

The Data Engineering Community