Stop Thinking in Data Pipelines, Think in Data Platforms: Introducing the Analytics Engineering Framework

Oscar Pulido
Google Cloud - Community
7 min readOct 28, 2024

Imagine a world where you could deploy your entire enterprise-ready data platform in minutes and empower your data practitioners to independently write complex, end-to-end data pipelines in a standardized and scalable way, allowing them to focus on insights from day one.

This is the vision behind the Analytics Engineering Framework (AEF), a comprehensive, opinionated design and sample code for building and deploying robust, flexible, and cost-effective data platforms in Google Cloud Platform (GCP).

The challenge with maintainable Data Pipelines

While BigQuery offers a comprehensive solution for modern data challenges by unifying data management, governance, and analysis within a single platform, building a robust DataOps strategy around it often feels like assembling a complex puzzle, requiring organizations to piece together various tools for IaaC, CI/CD, data ingestion, orchestration, and governed cataloging.

This complexity leads to substantial engineering costs for setup, maintenance, and training, creating a high barrier to entry for organizations seeking to leverage the power of data analytics on Google Cloud Platform while simultaneously adhering to software engineering principles for a maintainable platform.

The situation is further complicated by the often unclear ROI of data initiatives. Investing heavily in a complex platform can be risky, especially when the value derived from the data is uncertain.

  • Fragmented Development: Individual pipelines are built in isolation, leading to inconsistencies, redundancy and avoiding organized growing and fast new use cases implementation.
  • Centralized Bottleneck: Data practitioners often rely on data engineers to build and manage pipelines, creating a bottleneck and hindering agility.
  • Limited Scalability and Flexibility: Scaling and adapting pipelines to new use cases can be challenging and time-consuming.
  • Cost Inefficiency: Orchestration tools like Apache Airflow can be costly to operate at scale, some organizations need cost-effective alternatives.

Data teams often end with technical debt surrounding CI/CD, IaS, observability, and the least privilege principle. Establishing a foundational data platform that proactively addresses these potential gaps would empower teams to concentrate their efforts on building their data pipelines.

AEF: A Paradigm Shift in Data Platform CI/CD

The AEF addresses these challenges by introducing a paradigm shift in data platform development. Its core design principles are:

1. CI/CD and Parameter-Driven Approach:

AEF leverages a multi-repository strategy, with dedicated repositories for:

  1. Orchestration Framework: Maintained by analytics engineers to provide seamless, extensible orchestration and execution infrastructure.
  2. Data Model: Directly used by end data practitioners to manage data models, schemas, and Dataplex metadata.
  3. Data Orchestration: Directly used by end data practitioners to define and deploy data pipelines using levels, threads, and steps.
  4. Data Transformation: Directly used by end data practitioners to define, store, and deploy data transformations.

This separation of concerns allows for independent deployment, scalability, and clear ownership of different platform components. Each repository should have its own CI/CD pipeline, enabling independent deployment and faster iteration cycles.

2. Embracing the Analytics Engineering Concept:

AEF is built on the principles of analytics engineering, empowering data practitioners to independently build, organize, transform, and document data using software engineering best practices. This fosters a self-service data platform where data practitioners can create their own data products while adhering to a federated computational governance model.

3. Agnostic to Orchestration and Processing Tools:

AEF is designed to be agnostic to the orchestration tool and data processing engine. While it provides sample orchestration code for Cloud Workflows, and Cloud Composer it can be integrated with other tools based on specific needs. This flexibility allows for seamless integration with existing systems and future-proofs the data platform.

4. Serverless-First Approach:

AEF prioritizes serverless technologies, leveraging the scalability, cost-effectiveness, and ease of use of services like Cloud Functions, BigQuery, and Cloud Workflows. This minimizes the need for long-term running servers, reducing operational overhead and costs.

5. Cost-Effectiveness:

By leveraging serverless technologies and providing a standardized framework for data pipeline development, AEF significantly reduces the overall cost of building and operating a data platform. This ensures cost-effectiveness and makes the platform accessible to a wider range of organizations.

Multi-Repository Strategy is Core to AEF

While data models are rarely changed due to their significant impact, they require robust schema evolution controls. Data pipelines, however, change more frequently and are developed by various personas. Reusable data transformations can also be shared across teams.

Additionally, core capabilities must be secured and standardized to ensure availability for multiple diverse teams while enabling a self-served data platform experience.

Therefore, segregation of responsibilities, independent repositories, and CI/CD pipelines are crucial for a scalable and robust data platform.

The multi-repository strategy and a robust CI/CD strategy are at the heart of AEF’s design, enabling:

  • Clear Ownership and Responsibility: Different teams can own and manage specific repositories, fostering a sense of ownership and accountability.
  • Independent Deployment and Scalability: Independent CI/CD pipelines for each repository allow for granular control over deployment and scaling of individual components that are released at different frequencies.
  • Isolation of Failures: Failures in one repository are less likely to impact other components, ensuring overall platform stability.
  • Easier Rollbacks: Changes can be rolled back more easily at a granular level, minimizing the impact of deployment issues.
  • Parallel Development: Multiple teams can work on different repositories simultaneously, accelerating development and fostering collaboration.

CI/CD plays a crucial role by automating deployment, and in that way enabling self-serve approach and minimizing errors. Additionally, it enables version control and rollbacks, allowing for easy recovery in case of issues. Automated tests can be integrated into the CI/CD pipeline, ensuring the quality and integrity of the code. By streamlining these processes, CI/CD facilitates faster iteration cycles, accelerating the development and deployment of new features and bug fixes.

Domain-Based Orchestration V.S Central Orchestration

Data orchestration is crucial for data lakes and warehouses. While Composer offers benefits, it doesn’t removes Airflow’s operational challenges. A serverless, cost-effective alternative using Cloud Workflows with automatic scaling and pay-per-use model optimizes resource usage, reduces costs, and accelerates time to value for some use cases.

The AEF offers flexibility to choose the better orchestration framework for each organization.

Domain-Based Orchestration: Isolating orchestration by domain potentially simplifies IAM and networking management but may lead to increased operational overhead and complexity. This approach is preferred in multi-domain environments with distinct data access and processing needs. This is demonstrated in the Data Orchestration repository, where one Composer environment is managed and deployed for each data domain team, with each team owning a copy of the repository.

Central Orchestration: Consolidating orchestration into a single project centralizes Data Ops and potentially reduces management complexity. This approach may be simpler to understand and manage, particularly in single-domain environments or those with shared networks. However, it may necessitate IAM adjustments. This is easily managed when using Cloud Workflows, as its serverless nature enables deployment within a single centralized project for all data domain teams.

Both approaches are valid, and the “definitive” guide can be adapted based on specific organizational requirements and constraints. Factors such as the number of domains, data access patterns, and networking configurations should inform the decision. Simplifying the development of complex data pipelines.

Data pipeline abstractions

One of the central components of the AEF is its approach to data pipeline orchestration. Recognizing the need for a simplified yet powerful orchestration abstraction, AEF introduces three core abstractions:

  • Steps: Represent individual data transformations, such as executing a BigQuery saved query, running a Dataflow or serverless Dataproc job, or even triggering a Dataform repository run.
  • Threads: Group a sequence of steps to be executed one after the other, enabling parallel execution of different sets of tasks.
  • Levels: Allow for a combination of sequential and parallel execution, with multiple threads running concurrently within a level and subsequent levels executing only after all tasks in the previous level are completed.

With these three simple concepts, easily defined as parameter files, end data practitioners can independently define complex data pipelines.

Translation to actual Cloud Workflows definitions or Airflow DAGs and subsequent deployment of these high-level orchestration definitions will be done as CI/CD pipelines steps.

Similarly, data practitioners will write simple parameter files to define their data transformations in the corresponding repository.

In the same way, the data model repository will store Dataplex metadata definitions to enable data governance, discoverability, and access control. Additionally, the data model repository will keep track of DDLs, schemas and datasets, including their deployment and version control.

Hands on

This demo lets you easily deploy and test the AEF. It includes everything you need to get started, including sample data and sources. You’ll see how to:

This demo deploys the entire AEF in a single Google Cloud project and shows the complete lifecycle of a data product, from raw data to a final, usable table.

Conclusion

The AEF provides a blueprint for building modern, scalable, and cost-effective data platforms on GCP. Its multi-repository strategy, coupled with a robust CI/CD implementation, empowers data practitioners to independently build and manage complex data pipelines, accelerating time-to-insight and enabling organizations to focus on extracting value from their data.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Responses (3)