Data Engineering on Cloud Data Platform

To realize data-as-product — in the organization and to clients, cloud data platform plays a very important role in our data engineering tech stack. We would like to review how our data team manages our clients’ data on Google Cloud Platform.

Basically, we try have our management meet what the modern ground rules:

Today’s organizations want to cost-effectively load, transform, integrate, and analyze unlimited amounts of structured and semi-structured data.

This article would follow the framework of Ch5 Outlining Technology Requirements in Cloud Data Engineering for Dummies. We would like to review and share our technology and experience about the following 3 aspects which imply the ground rules mean. This article is more a memoirs than a sharing. 🌝

  • 7 Fundamental Principles
  • Ingesting Data in Streaming or Batch
  • Delivering Business Value

7 Fundamental Principles

Here is 7 principles we should review or preview our data management.

1. Simplicity

Minimize data movement and reduce the number of systems we need.

►Crescendo Lab has done:

So far, we has several systems to serve as database, ingesting data and be responsible for the pipeline.

1. postgreSQL database on cloudSQL (OLTP)

2. BigQuery (OLAP)

3. Pub/Sub (handle streaming data)

4. Google Cloud Storage (handle batch files/ serve as data lake)

5. Airflow (pipeline management)

6. DataFlow (pipeline management)

We avoid having those systems share similar jobs to make sure all workloads are required and easy to maintain.

2. Flexibility

The data pipelines are flexible rather than brittle and uphold prevailing technology standards. Minimize maintenance when deploying new data sources and workloads.

► How Crescendo Lab has done:

1. Airflow: We manage our pipelines with Airflow on GKE. Considering flexibility, there are a few pros and cons of this popular tool to manage pipelines.

✔︎ pros:
It is pretty convenient that Airflow has kinds of workloads into a pipeline thanks for community ‘s efforts on creating and optimizing many kinds of operators. Even though airflow has restrict constraints on its package.

✘ cons:
Airflow has strict constraints on installing. Fortunately, KubernetesPodOperator might solve the problem with starting new pods with customized images.
For more details about KubernetesPodOperator: https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html

2. Others: We leverage GCP’s native services to handle data integration because most of our data systems are on GCP. (e.g. Dataflow could take charge of data movement from Pub/Sub to BigQuery)

3. Scale

Data pipelines are architected to accommodate growing workloads, more use cases, and an escalating number of users.

► How Crescendo Lab has done:

postgreSQL (OLTP):
- Fine Tuning :
In the current phase, we solve the PG’ scalability issues by fine-tuning our workloads and schema indexing. We haven’t utilized PG efficiently so far, and our workloads are far from the real scalability issues (e.g. large data volume or high write pressure).

-Monitoring:
We have QueryInsight for our postgreSQL on CloudSQL to monitor the loads of query on each database.

Screenshot for QueryInsight on cloudSQL on GCP

- Right Tooling:
In addition to fine-tuning workloads and indexing on PG, we also start to clarify whether some use cases are suitable to host on PG or not. For example, relational DBs are not good at OLAP workloads, such as aggregating member status or querying by complex conditions. Therefore, we move these use cases to Elasticsearch for better performance.

BigQuery(OLAP/ data warehouse):
- Internal use:
We take DataStudio as our internal BI tool for different departments, e.g., PM, CSM. We leverage IAM to provide just-enough permission for them to access the data and compliance with our data privacy policy.

- External use:
We have BigQuery to execute some heavyweight job. Actually it haven’t caused any problem but we prefer to decrease our cost on query. Partition and clustering could be pretty helpful.

- Patterns for organizing resources:
According to what google recommend the pattern of resources on BigQuery, We adopt `Department data lakes, central data warehouse`. It is simple to manage our analytical result in centralized way. Our raw data or output comes from our product/feature could be organized in respective projects.

Pub/Sub:
- Dataflow:
We had our homemade Pub/Sub-to-BQ worker. It was pretty simple and cost-effective. However, these workers had met problem to scale themselves on huge traffic peak. Dataflow — the serverless ETL service and native choice on GCP— helps solve the problem of scalability.

Google Cloud Storage:
We use GCS to preserve both system and user-generated files for long-term storage. It’s a very cost-effective storage solution. We aim to leverage more its features, e.g., lifecycle and different storage classes, to optimize our storage cost.

More about storage classes: https://cloud.google.com/storage/docs/storage-classes

4. Build or Buy

More problems we mentioned, more likely we need some specific data systems to relieve the pipeline. Comparing by price, performance and total cost of ownership with different data systems when we make decisions about buying or building data integration tools, utilities, and infrastructure.

► How Crescendo Lab has done:

- Dataflow / Owning workers:
we used to rely on workers we owned on GKE to ingest Pub/sub messages to BigQuery as we mentioned earlier. The method did not work well when we encountered scaling issues. So, we take a hybrid approach that we use our home-build cost-effective solution as a baseline, and enable Dataflow on those high-traffic Pub/Sub topics as a scaling solution.

5. Self-service:

Democratize data by providing friendly environment in the organization.

► How Crescendo Lab has done:

As a company try to solve clients’ problem with data, we believe all members in the organization could access all data and find how their contribution represents in the data — including people in business functions.
- Metrics Layer:
A system providing a standard set of definitions on top of the data warehouse. It helps team who is not data owner understand the meaning efficiently.

- Data Warehouse:
We follow the pattern for organizing BigQuery resources recommended by Google — Department data lakes, central data warehouse. All users can query data in our warehouse project. All dataset in the project could be modified the permission for different team. All metrics layer data also presented in BigQuery’s schema.

- Dashboards:
Thanks to good integration of DataStudio and BigQuery, once we give the right access to the users, they are free to visualize our data in BigQuery by DataStudio. To make this `democratizing` do happen, we also have internal workshop to train members be experts on DataStudio.

6. Investments

By reviewing our existing languages, processing engine and integration procedures, how to maximize investments, minimize training, and make the most of available talent?

► How Crescendo Lab has done:

- KubernetesPodOperator on Airflow:
What the existing property we have is Airflow. To maximize the utility, we tried to extend our existing pipeline by building ML pipeline.

- ElasticSearch:
To replace BigQuery with a supporting high frequency data source, it is would be much safe and cost-efficient to use ElasticSearch to do some query which would be trigger by customers’ end.

7. Versatility

Data pipeline should be able to handle structured, semi-structured, and un-structured data.
Data pipeline should be able to ingest batch data and real-time data.

► How Crescendo Lab has done:
-
structured: postgreSQL, BigQuery,
- semi-structured: stored in BigQuery and one more table for flatten version
- unstructured: GCS

We would talk more about ingesting batch data and real-time data in following parts.

Ingesting Data in Streaming or Batch

We basically ingest our data from different systems, we present all our data ingesting sources/ storages at:

  • GCS
  • BigQuery
  • Pub/Sub

Therefore we have some pipeline components like:

  • Anywhere to Pub/Sub (streaming/ batch)
  • Pub/Sub to BigQuery (streaming/ batch)
  • Pub/Sub to GCS(batch)
  • BigQuery to BigQuery (Batch)

Since all our data workloads work on GCP, it is helpful for data engineers to build the pipeline and solve most of scale issue(e.g. DataFlow did solve one!). The composition of components aims to provide our data warehouse and dataset for applications on machine learning. That is, our business goal is to create value from the warehouse and model.

Replicating existing databases

The replicating job involves preexisting dataset and the APIs. They need a new destination location and schema for the modeled data.
These jobs moving data incrementally usually are executed with SQL MERGE statements. We use this statements to update our users’ shopping events from GoogleAnalytics API. Here is a simple example of how we merge the shopping events for e-commerce clients in BigQuery:

So far, we has Airflow to automatically create the table and run tasks to insert/update the table periodically. Airflow is a perfect match for our current data pipeline usage. And, we continuously investigate resources on it to optimize our pipeline and improve operations.

Transforming Data to Address Specific Business Needs

The goal of data transformation is to integrate, cleanse, augment, and model the data so the business can derive more value from it. We have to ensure that:

  • what is our business goal
  • make sure the data is accurate
  • data is easy to consume
  • data is governed by consistent business rules and processes.

To achieve this goal by transformations which has two main phases: design and execute.

Design

In the design phase, we have several approaches to take:

  1. data integration tools we have:
    - airflow
    - BigQuery scheduler
    - DataFlow
  2. These tools can design jobs with python and SQL;
    Airflow provides much more flexibility for us to extend existing workflows. Some jobs for cleaning and transforming data could be handled with Pandas and other python libraries. In comparison, BigQuery scheduler is pretty convenient for users who familiar with SQL to have lightweight jobs without dependency. Another point I wish to mention is that scheduled query supports SCHEDULE BACKFILL which does help our analytic jobs on BigQuery.
    Another case is DataFlow. DataFlow is based on Apache Beam which written by JAVA but supports python too. There might be some restrictions on developing with Python but we have not meet that situations yet.
  3. Carry existing transformations forward: Airflow would be much better choice for it support dependency as DAG.
  4. Complexity. Airflow also support many kinds of provider package. Or say, KubernetesPodOperator could always solve problems.

Execution

Transformation processes require lots of compute cycles.

We always push the resource-intensive transformation workloads to the cloud so we can run the complicated transformation logic after ingesting the data. Our Airflow is working on Google Kubernetes Engine, it implies we could auto-scale and modify max process in same time in the config file. We could even have a specific pool for providing nodes with GPU and help us assign the resources well.

By having an ELT process to store data into the cloud data systems, we can use the boundless resources of the cloud to process and transform it quickly and efficiently. (e.g. we use BigQuery make our data processing works in serverless-computing)

Conclusion

Like what we mentioned in the very beginning, we take all the efforts to implement our data management to meet the ground rules :

Today’s organizations want to cost-effectively load, transform, integrate, and analyze unlimited amounts of structured and semi-structured data.

Based on the 7 principles, we examine our pipeline and understand what problem might be possible to become our bottleneck. We try our best to ingest either data in streaming or batch and from all system sources. Most importantly, we can do polish our data and make them products.

Since last two year, we have built the pipeline from a minimum viable pipeline to the mature pipeline. Owning our data platform on GCP helps our data team to address most of our requirements. Workloads on GCP also do relieve our pain to do integration and handle variety of data type.

To meet our final goal, publishing our data as product (externally or internally). We are not dedicated to build or implement a complete, fast and hard data system. Instead, create/explore value of data is what we aim. We try hard to have all processing and features could be as simple/extensible as possible.

Future Work

As more data we are going to ingest and collect, we definitely would meet bottleneck sooner or later. Most of our collecting jobs are working on Airflow. As more executors queued and duration time increase, it implies we have to move some heavy-weighted jobs to new solution on Could. Like we mentioned how to handle data from Pub/Sub to BigQuery, cloud solution did save lots of our time. For small data team like we have been, it is the somehow best practice to follow on GCP.

--

--