Don’t use SQL for your Big Data Pipelines

Published in

The Startup

5 min readMar 17, 2020

The time of coding data pipelines with high-level programming languages

Good old “use the right tool for the job”

ETL pipelines have been made with SQL since decades, and that worked very well (at least in most cases) for many well-known reasons.

In the era of Big Data, engineers and companies went crazy adopting new processing tools for writing their ETL/ELT pipelines such as Spark, Beam, Flink, etc. and started writing code instead of SQL procedures for extracting, loading and transforming their huge (or little — sic!) amount of data.

The Age of Testing Your ETL/ELT Pipelines

Since ETL/ELT pipelines can now be implemented as (a sequence of) jobs written in high-level programming languages like Scala, Java or Python, all your processing steps are expressed as pieces of code that can be structured, designed and, most of all, fully covered with automated tests in order to build robust development and deployment pipelines. In fact, data engineers can now take advantage of working in an environment that includes common tools like git, pull requests, automated tests, builds and even deployments saying goodbye to visual tools embedding huge SQL queries that are almost untestable, unreadable, unmaintainable and hated by any developer (people being involved in traditional DWH development can relate).

Don’t use SQL for your Big Data Pipelines

The Age of Testing Your ETL/ELT Pipelines

Written by Giulio Mazzeo