ELT with Google Bigquery

Pooja Kelgaonkar
Google Cloud - Community
4 min readAug 26, 2021

ELT — Extract, Load and Transform — Building ELT pipelines are little different than building an ETL pipeline. There is difference in approach of implementation with these 2 types — ETL & ELT.

ETL process is loading the data into the target system after the transformation takes place within the data integration tool, the ELT approach firstly loads the data into the target system before transforming the data. Often the ELT process is nowadays preferred over the traditional ETL process because it’s simpler to realize and loads the data faster. The differences are shown with below image -

ELT — Extract, Load and Transform — Building ELT pipelines are little different than building an ETL pipeline. There is difference in approach of implementation with these 2 types — ETL & ELT.

ETL process is loading the data into the target system after the transformation takes place within the data integration tool, the ELT approach firstly loads the data into the target system before transforming the data. Often the ELT process is nowadays preferred over the traditional ETL process because it’s simpler to realize and loads the data faster. The differences are shown with below image -

Transform data while loading to target system. Data Read, transform and load on run time can be implemented with a single data pipeline. ETL pipelines with GCP services can be implemented by using choice of your services from some of below -

  • Data Fusion
  • DataFlow
  • DataProc
  • Cloud Functions

Transform data post loading data to target system is referred as ELT . To apply transformations post loading, these can be implemented using SQL for better performance. This approach is preferred one with GCP migrations to BigQuery . ELT pipelines with GCP can be implemented by using pipeline as below -

Extract/upload job -> load files/data to GCS -> BigQuery -> Transform jobs (SQL/Procedures/functions) [Orchestration using Airflow]

To upload source data to GCP, there are set of choices to be setup on GCP from below list -

  • Data Transfer Service
  • Source extract jobs using Dataflow
  • Data import jobs using Sqoop/Spark on dataproc
  • Source file uploads to GCS
  • Pub\sub for real time data streaming to GCS
  • using ETL tools to push data between systems to GCP/GCS/Bigquery

To transform and as a target layer , we can choose from one of below GCP Data services -

  • BigQuery
  • Cloud Spanner

Bigquery is widely used as Data warehouse service/Serverless Data warehouse implementation on GCP. This is used to integrate with GCP or outside GCP services/processes. Bigquery performance is one of key to get DW/Data Lakes migrated/setup on GCP. BQ can also be used as analytical service to build analytics on top of GCP using any BI tools/AI/ML models.

Approach 1 -> Data Transfer Service + BigQuery

Data transfer service used to copy/migrate data to GCP. using this service for data uploading is preferred choice for performance. Once data landed to GCP, Bigquery is used as DW service to expose source data/landed data as staging layer. Transformations can be built using SQL — BQ jobs, Procedures, functions. BQ views can be created for analytical exposure to BI/Data scientists.

Advantages :

  • Use of GCP native services
  • Transformation can be done by SQL. ANSI SQL standard supported in BQ (Even less technical experienced staff can implement the data preparation logic.)
  • DTS and Big Query are full SaaS technologies (No extra setup needed for infrastructure and scaling.)
  • DTS can handle any type of data, volume, system to be migrated to GCP

Approach 2 -> GCS + BigQuery

with this approach , we are getting source data sourced to GCS landing layer/bucket. This can be done using Nifi, GCS utilities, Distcp , Sqoop, Spark jobs, Dataflow etc. depending on source systems and type of data unloading. below can be some choices →

  • Source system is RDBMS — Dataflow , Sqoop/Spark on DataProc ephemeral cluster , ETL tools
  • Source system is File based system — SFTP/GSutil to push files to GCS
  • Source system is Hadoop based system — gsutil, DistCP files to GCS — push/pull method based on the volume & target systems , Sqoop to push data to GCS, Spark jobs to push data to GCS
  • source is Teradata — TD to BQ connector to get source data

Once data is available on GCS, getting source to GCS is preferred approach instead of loading directly to BQ. with GCS, we can archive datasets, use it for failures, source data issues testing , version control etc.

BQ transform and target layer is typically implemented with below approach →

BQ Staging layer (source data as is ) -> BQ Tranform layer (transformations using BQ SQL ) -> BQ Target layer (analytical layer/aggregation layer) -> BQ Reporting layer (using views if needed)

Advantages :

  • Use of GCP native services
  • Transformation can be done by SQL. ANSI SQL standard supported in BQ (Even less technical experienced staff can implement the data preparation logic.)
  • GCS and Big Query are full SaaS technologies (No extra setup needed for infrastructure and scaling.)
  • GCS can handle/store any type of data, volume, system to be migrated to GCP
  • References -> https://cloud.google.com/architecture/dw2bq/dw-bq-data-pipelines
  • About Me

I am DWBI and Cloud Architect! I have been working with various Legacy data warehouses, Bigdata Implementations, Cloud platforms/Migrations. I am Google Certified Professional Cloud Architect .You can reach out to me @ LinkedIn if you need any further help on certification, GCP Implementations!

--

--

Pooja Kelgaonkar
Google Cloud - Community

My words keep me going, Keep me motivating to reach out to more and more!