Data Integration patterns for ML/Data Engineers

Nikhil (Srikrishna) Challa
Google Cloud - Community
5 min readMar 6, 2024

If you are a data engineer or ML engineer, it is essential to have a good understanding of different data integration techniques. In this blog, let us explore different data integration techniques and the ecosystem of tools that can be brought together while building architectures involving any of these techniques.

ETL, ELT, and EL all represent different techniques for integrating data from various sources into a target system like a data warehouse or data lake. They differ in the order of data transformation:

  • ETL (Extract, Transform, Load): In ETL, data is extracted from source systems, transformed to a consistent format, and then loaded into the target system.
  • ELT (Extract, Load, Transform): In ELT, data is first extracted from source systems and loaded directly into the target system. Then, the data is transformed within the target system itself.
  • EL (Extract, Load): This is a variation where data is simply extracted from source systems and loaded into the target system in its raw format, with minimal or no transformation.

ETL → Extract, Transform and Load

This option is a conventional approach to move the data from source to target and considers transformations on the data as a part of the pipeline that loads the data into target data warehouse.

Modern data architectures though considers building a semi ETL approach where raw data persistence is given huge importance. This gives an option to recover the data as it is retained in its original format and re-process it, especially in the event of disaster or fault recovery.

There are multiple tools and products which come together while building ETL solutions on GCP and below is a reference architecture

  • In this architecture, we have data sources on the extreme left which has multiple connection patterns namely CDC/event based, API based, File based & JDBC based which allows SQL queries to be run against the source database.
  • In this architecture data is extracted using either an on premise hosted ETL tool or a data transfer service and loaded to a Landing zone which is either BigQuery (for structured data) or Cloud storage which is an object oriented storage (for semi structured or unstructured data)
  • This is followed by running transformation pipelines which does all the curation, processing/cleansing, standardisation of the data, fixes any data quality errors and loads it into a Data warehouse which is often the curated zone.
  • In GCP, there are multiple ETL options namely, Cloud Dataproc, Cloud Dataflow, Cloud Dataprep and Cloud Data Fusion

ELT → Extract Load & Transform

BigQuery as a Datawarehouse comes with some super compute powers and it is a right representative of MPP (Massively parallel processing) data warehouse.

Its a crime not to consider BigQuery’s super compute powers while transforming the data, especially when applying transformations get easier by applying SQL functions

Below is a reference architecture for ELT

  • In this architecture, the key is to export the data from a persistent on premise data storage to Google Cloud’s storage buckets, where the data is persisted on Cloud in its native format
  • The data movement from on premise persistent storage such as sFTP, MFT, HDFS etc to GCP’s cloud storage can be facilitated using storage transfer service or distCP or any transfer appliance
  • The data from Cloud storage bucket can be moved to BigQuery’s Landing zone using BigQuery Data transfer service or a template available in Dataflow (does not require heavy coding efforts)
  • The real ELT is when the data is moved from BigQuery’s staging tables to BigQuery’s Data warehousing tables and this is performed using SQL
  • Since transformations via SQL consume the same compute as interactive queries which is SLOTs, the operational overhead with just be about managing the slot utilisation and forecasting.

There is a feature that a lot of modern ETL tools are developing, known as ‘Transformation pushdown’ which in a way promotes the ELT approach

Transformation pushdown is a capability that allows performing the transformations on the target system servers instead of data integration engine and its servers. Ex: If Apache Spark is used for ETL to load the data to BigQuery, Transformation pushdown allows for the execution to happen on BigQuery servers instead of Spark clusters. Ex: Cloud Data Fusion/CDAP, IBM Data Stage, Matillion etc

EL → Extract and Load

This particular approach does not involve any transformation of the data. It purely is lifting and shifting of the data from source(s) to Target.

Below is a reference architecture for EL:

  • In this architecture, the objective is to load the data into BQ directly and avoid hops as much as possible
  • The key product consideration is BigQuery Data Transfer service which can directly connect to a source and load the data into BQ Staging tables
  • The follow-up data movement can involve transformations/usage of ELT or ETL products, but the advantage here is to allow for a hassle-free data movement from source to GCP
  • BigQuery Data transfer service offers multiple connectors to AWS, Azure, Facebook, Oracle, Salesforce, Teradata etc
  • It can also be used to synchronise source and target systems which is extremely handy while implementing offload migration
  • This type of approach normally do not involve any transformation and relies on just extracting the data and loading it into the target
  • If there is a source which is not compatible with the data transfer service, any other data integration tool can be considered too

Choosing the right Data integration pattern:

There are various factors that influence the decision of choosing a right integration pattern. Consider the below if you are designing a data architecture

  • High Volume & Velocity — ETL
  • Low volume & velocity — EL/ELT
  • Complex transformations — ETL
  • Simpler transformations — ELT
  • Data Governance with Clear Lineage — ETL
  • Less stringent on lineage — ELT
  • Highly cost effective — EL/ELT
  • Costly yet balanced — ETL
  • High technical expertise required — ETL/ELT
  • Easy to implement — EL

Thanks for reading!

--

--