Data Pipeline Architecture Examples And Diagrams From Real Teams

Michael Segner
9 min readAug 17, 2023
Image courtesy of the author.

Data pipelines move data from a source to target system, often with some transformation along the way.

It’s important to understand this movement is rarely linear, and instead is a series of highly complex and interdependent processes.

For example, data may be extracted from multiple sources, reorganized, and joined at different times before ultimately being delivered to its final destination.

Good data pipeline architecture is critical to solving the 5 v’s posed by big data: volume, velocity, veracity, variety, and value. And a well designed pipeline will meet use case requirements while being efficient from a maintenance and cost perspective.

Now, let’s take a look at some common data pipelines before diving into some diagrams from real data teams.

In This Article:

-5 Data pipeline architecture designs and their evolution

-7 Best Practices

-6 Data Pipeline Architecture Diagrams From Real Data Teams

5 Data pipeline architecture designs and their evolution

Image courtesy of the author.

The Hadoop era, roughly 2011 to 2017, arguably ushered in big data processing capabilities to mainstream organizations.

Data then, and even today for some organizations, was primarily hosted in on-premises databases with non-scalable storage. Despite Hadoop’s parallel and distributed processing, compute was a limited resource as well.

As a result, data engineers spent considerable time modeling data and optimizing queries to fit within these constraints. Data pipeline architecture typically consisted of hardcoded pipelines that cleaned, normalized, and transformed the data prior to loading into a database using an ETL pattern.

Some organizations choose to still leverage an ETL pattern in the cloud, particularly for production pipelines where data contracts can help reduce data downtime.

Image courtesy of the author.

The modern data stack era, roughly 2017 to present data, saw the widespread adoption of cloud computing and modern data repositories that decoupled storage from compute such as data warehouses, data lakes, and data lakehouses. The term modern data stack refers to the multiple modular SaaS solutions that comprise the data platform and pipeline (more on those later).

With cost and physical compute/storage limitations largely lifted, data engineers started to optimize data pipeline architecture for speed and agility. Data could now be extracted and loaded prior to being transformed for its ultimate use. This scale and flexibility of the cloud and an ELT design pattern unlocked additional valuable use cases such as more widespread analytics, experimentation, and machine learning applications.

Image courtesy of the author.

While these batch data pipelines were ideal for analytical use cases, there has been an increased need for near-real time data. Streaming data pipeline architectures are typically run in parallel to modern data stack pipelines and used mainly for data science or machine learning use cases. The pattern can be described as stream, collect, process, store, and analyze.

This pipeline architecture can present challenges such as the inability to fully validate data quality or model the data. However, the value and benefits of near-real time data are hard to deny.

Data pipeline architectures are constantly being reinvented. Two emerging data pipeline architectures include zero ETL and data sharing.

Image courtesy of the author.

Zero ETL is a bit of a misnomer. These pipelines differ from traditional ELT pipelines by doing the data cleaning and normalization prior to load. In many cases the data stays in a data lake and is queried from there versus moving to the data warehouse.

At the moment, this tight integration is possible because most zero-ETL architectures require both the transactional database and data warehouse to be from the same cloud provider. Some examples include AWS (Aurora to Redshift), GCP (BigTable to BigQuery) and Snowflake (Unistore).

This trend is competing with another data pipeline architecture undercurrent where some organizations are considering decoupling their storage and compute at the vendor level with companies like Tabular.

Image courtesy of the author.

No copy data sharing is another emerging data pipeline architecture being pioneered by Snowflake (Secure Data Sharing) and Databricks (Delta Sharing). Unlike ZeroETL, this process actually doesn’t involve ETL and instead provides expanded access (permissions) to the data where it’s stored.

Remember, these aren’t a binary choice. Most organizations deploy some or all of these data pipeline architectures.

7 Best Practices

  1. Map and understand the dependencies of your data pipeline using an automated data lineage solution. No human can keep track of all the different dependencies within a complex data pipeline and documentation (when it exists) rarely gets the job done. Without this context your data team will either unintentionally commit breaking changes or petrified to change anything for fear of causing havoc downstream. Check out our article on 17 data lineage use cases for more context.
  2. Design your data pipeline so it is modular and automated. Your team will need to make frequent changes so a simple, modular pipeline that is easy to change is often better than the perfect pipeline that needs to be completely refactored each time data changes at the source. While open source data pipeline solutions are attractive from a cost perspective, keep in mind the cost of maintenance. Integrations are typically lagging and teams can spend more time on their architecture than getting value from it.
  3. Create data pipeline SLAs (service level agreements). It’s crucial to match the data pipeline architecture to the use case. Do your consumers need data refreshed every second, minute, hour or day? That will likely determine if your data pipeline architecture will feature stream, micro-batch, or batch ingestion. Does data quality need to be high will directionally accurate suffice? Codifying these expectations keeps all parties accountable.
  4. Let the data drive the data pipeline architecture. Most data teams will be handling mostly structured data for analytical purposes making a data warehouse based data pipeline architecture a natural fit. Some data teams may be handling more unstructured data for data science use cases and consider a data lake. Others may have to create more custom built architectures as Netflix Studios Senior Data Engineer Dao Mi told us about his time at Nauto, which develops AI software for driver safety. “There the main data types we handled were telemetry and video from the dashcams. Our storage and infrastructure paradigm was built around how we could best process this data, which essentially meant a lot of homegrown builds because commercial options didn’t exist,” he said.
  5. Create data products. Data-as-a-product is a shift in thinking introduced in Zhamak Dehgani’s data mesh concept. A data product is a data asset (typically important tables or dashboards) that holds significant value for the company. Data products are discoverable, secure, governed, trustworthy (high data quality), and interoperable.
  6. Continuously review and optimize costs. Poorly written and degrading queries are not only expensive, but they can create data reliability issues as well. Check out some helpful strategies for optimizing Snowflake costs and identifying these problematic queries.
  7. Make pipelines idempotent. If you don’t make your pipelines idempotent, meaning that executing an operation multiple times produces the same result as executing it once, you can prevent inconsistent or duplicate data.

6 Data Pipeline Architecture Diagrams From Real Data Teams

JetBlue‘s data pipeline architecture

JetBlue’s data pipeline architecture balances multiple data sources and data freshness requirements. Snowflake Tasks are used for near real-time loading in lieu of a more traditional streaming data pipeline architecture. FiveTran is used to batch ingest data which lands in Snowflake and is transformed by dbt and monitored for quality by Monte Carlo. Databricks supports it’s ML and AI use cases. Learn more by checking out the webinar they did with Snowflake.

Fox Networks‘ data pipeline architecture

Fox’s data pipeline architecture is designed for resilience and reliability, in alignment with the network’s goals. When you are airing the Superbowl, you want everything to run without a hiccup. As such, reliability tools such as Datadog (application performance management), Monte Carlo (data observability), and PagerDuty (incident management) are all present. You can also see a robust streaming and microbatch architecture with Spark and various AWS services given their need for real-time data.

Fox’s VP of Data Services, Alex Tverdohleb told us his data pipeline architecture is designed with self-service in mind.

“Everything else, especially within data discovery and your ad-hoc analytics, should be free,” said Alex. “We give you the source of the data and guarantee it’s trustworthy. We know that we’re watching those pipelines multiple times every day, and we know that the data inside can be used for X, Y, and Z — so just go ahead and use it how you want. I believe this is the way forward: “striving towards giving people trust in the data platforms while supplying them with the tools and skill sets they need to be self-sufficient.”

Swimply‘s data pipeline architecture

Startup Swimply’s data pipeline architecture is well-automated and designed with simplicity and scale in mind. Fivetran, Snowflake, dbt, Monte Carlo, and Looker are solutions that integrate very well together.

Head of data Michael Shledon says, “Because we had this mandate as a data team to support the entire company, we needed a data stack that could solve two central issues. One, to centralize all of the data from all of the different parts of the company in one stable place that everyone could use and refer to as a source of truth. And two, to enable us to have enough time to really focus on the insights and not just the data infrastructure itself.”

Backcountry‘s data pipeline architecture

Backcountry relies on data to power nearly every facet of its business, including customer acquisition, segmentation and personalization, marketing, merchandising, and customer support. But the data team was ingesting large amounts of data from an ever-increasing number of sources, and the legacy system wasn’t able to perform at scale.

So the Backcountry team had to move quickly to migrate from a legacy stack to a modern data stack. They built their stack around the Google Cloud Platform, including BigQuery, Looker, Airflow, and Fivetran.

BlaBlaCar‘s data pipeline architecture

BlaBlaCar’s data pipeline architecture is part of its data mesh implementation. This transition to a more decentralized structure was critical to accelerate team efficiency following two M&As.

“We had some new use cases around data science that were hard to fit into our existing org. That’s essentially what triggered the discussion of ‘We need to do things differently,’” said Emmanuel Martin-Chave, VP of data analytics.

Drata‘s data pipeline architecture

VP of data Lior Soloman got to build this modern data architecture from the ground up and he stood it up fast. He said, “Choosing the right technology stack is critical for a fast-growing startup. I wanted to select technologies that would scale with our fast-growing business, be easy to use, and seamlessly integrate with our existing systems.”

Now Go Build Some Data Pipelines!

Remember, it’s not about having the right data pipeline architecture, it’s about having the right data architecture for you.

--

--

Michael Segner

I write about data quality and data engineering for Monte Carlo.