SQL is not dead (never was!)

Nicolas Colomer
Tinyclues Vision
Published in
6 min readNov 4, 2021

--

Image courtesy of Columbia University Archives

The SQL language has allowed users to query and transform structured data for decades now.

Although it might appear obvious to make SQL a first-class citizen of your data stack today, it wasn’t when Tinyclues started.

This blog post is about sharing our journey, transitioning from a non-SQL homegrown data stack to SQL-native fully managed one.

Our early days

Tinyclues was born 10+ years ago, in the early big-data era. It was pretty much the Wild Wild West when it came to facing big data use cases like loading, transforming and storing large and heterogeneous structured datasets.

Although we had started to see inspiring production use cases of frameworks like Hadoop back in the early days of Tinyclues, it was not really a good bet for a small startup team that want to focus on their core business — complex ML data science problems — instead of wasting time experiencing emerging tech.

That’s why we built our own custom data framework, dedicated to serve data to our core predictive engine. This meant being able to Extract, Load and Transform our customer’s first-party data and prepare it for learning and inference. This framework served us well for nearly 10 years.

We maintained it all along those years, including backlog and roadmap, bug fixing, library upgrades, etc. up to the moment we realized that the state of the art not only caught up to us and offered even more capabilities: managed runtimes, complex data model orchestration, ability to work incrementally, all of these compatible with SQL, a language actually covering 90% of our data transformation use cases.

Why (we) choose SQL?

So came the moment to decide to switch from our custom data framework —a python DSL plus its implementation, including a low-level I/O layer— to SQL and all the ecosystem had to offer in terms of orchestration. Here is what motivated our choice:

SQL is universal

  • taught at school, making everyone at least know the concepts and minimum experience
  • standard for decades, crossing generations!
  • easy-to-hire skill, a direct consequence of the above
  • is seen mostly everywhere, cross roles (engineers, scientists, analysts)

SQL is THE data language

  • dozens of years of maturity thanks to usage and standards
  • built on purpose, including Data Definition (DDL) and Data Manipulation (DML) command sets
  • even support handling unstructured data (full text, json)

SQL is multipurpose

  • wide range of use cases: OLTP, OLAP, big data
  • extendable via User Defined Functions and other languages extensions
  • you can solve complex data problems: join, aggregation (like HLL), pivot, trees, sorting
  • ability to import/export data (format and structure SerDes)

SQL can execute on many backends

  • closed and open source (eg. postgres, presto)
  • self-hosted or managed service (eg. GCP Big Query / AWS Athena)
  • other sql-fluent frameworks like Spark (SparkSQL) or Mongodb (MongoQL)

SQL backends are powerful

  • cardinality is no more a limit
  • parallelized/distributed runtimes are common
  • most operations are highly optimized

All those characteristics made the SQL language ideal for Transforming — and in some cases Loading — data. Coupled to the right orchestration framework, it easily became a no brainer for us as it checked the box on a majority of our current and future use cases.

Tinyclues’ 2021 data stack

Choosing SQL alone is obviously not enough, we needed to wrap it in a stack that can leverage it. And then comes the engineering part. But as we gathered our learnings from building our legacy stack, we converted them into requirements:

  • SQL-native, see all reasons above;
  • priority to managed runtimes, as we didn’t want to deploy and maintain complex runtimes by ourselves;
  • keep it simple, to avoid complex architectures, harder to monitor and evolve. Integrated with the ecosystem is better;
  • capitalize on Airflow, as past experience made us comfortable with this great orchestrator;
  • git-friendly, to ease configuration updates, audit and review. See our How we tamed our multi-tenancy model in less than a week blog post for more details;

After digging the various possibilities and combinations, we finally ended with the following choices:

  • BigQuery as our SQL data warehouse backend;
    Managed data warehouse offer is large in 2021: Snowflake, Databricks, Athena (AWS), etc. We chose to stick to the GCP ecosystem especially because BigQuery is fully integrated (eg. you benefit from ML features) and 100% serverless.
  • dbt as our logical data transformation layer;
    dbt is the growing data warehousing framework. It includes tooling and conventions to write, organize and maintain quality of a SQL transformation codebase.
  • Composer (Apache Airflow) as our orchestration backend;
    No need to introduce Airflow as an orchestration layer. Composer makes it even easier to deploy, monitor and scale.
  • Google’s Cloud Storage + Cloud Function as the thin glue to pump inbound data (ie. the Extract & Load part)
    In line with the serverless and reactive architecture spirits, this duo helps you build a thin data loading layer while ensuring proper monitoring.

Any point of attention so far?

Those are things we discovered during the conception phase or along the transition. Obviously, nothing was a blocker and we can always find a solution, but it’s worth preparing yourself to face them at some point.

SQL is not a programming language

SQL is a specialized Domain-Specific Language (DSL). Even though you can get close to a programming language (using scripting, UDF and procedural functions), that does not mean you should do it because it might induce extra complexity you don’t want to maintain.

Let’s say you need to implement a very specific aggregation. Depending on the backend, you might not be able to implement it natively and, in the worst case, you’ll be forced to use another system to achieve this. Some SQL frameworks allow this though, like Spark’s UDFs and UDAFs. BigQuery allows UDFs, but it also comes with limitations.

SQL can be verbose

As for any programming language, there is no unique way to implement what you need to achieve. To overcome this, you will need to enforce strong conventions within your organization via shared coding standards (eg. naming, indentations, CTE). dbt-labs provides a concise guide with conventions they use internally.

Moreover, your SQL codebase will grow exponentially to the data stack itself, with the increasing number of contributors and use cases. As for a programming language, to ensure the produced code remains syntactically consistent, you can use a linter like SQLfluff.

SQL is permissive

SQL is a permissive language, this means you can easily write unoptimized things. Performance issues will increase with the complexity of data problems you’re trying to solve.

You’ll need to understand optimization levers of your backend and might buildup notions of distributed processing at some point, eg. when a query triggers full scans or when it shuffles data.

Prepare your data engineers/analysts to understand how your SQL backend is working and its limits, dig into an execution plan, etc. It will make them more autonomous, and eventually save you time and costs.

SQL is more difficult to test

Testing a SQL query means you have a runtime up, data to be processed and the ability to compare results. As a SQL contributor, you’d like to focus on the SQL logic, input data and expectations, without bothering the context preparation and test execution boilerplate.

Hopefully, dbt comes with an extensible SQL testing framework for testing the structure and the content of your models. You can either use declarative high-level tests on your models or implement more specialized ones thanks to generic tests.

When your SQL query becomes too complex, you can break it down into macros, that can also be easily unit tested independently.

SQL is more complex to debug

Not as with programming languages like Python —which allow you to get the exact location and context of any raised error, or even debug the runtime— SQL runtimes are often more opaque when your query fails.

Often debugging a query ends with “test and learn” iterations to eventually converge toward a solution. The experience here totally depends on the backend you chose: explicit error messages and contextualization plus documentation must be considered during your evaluation.

Conclusion

SQL is not just a query language to produce insights. It can also simplify the task of transforming data for organizations.

Migrating from a non-SQL data stack to a SQL-native one allowed us to go faster and deliver better outcomes than ever before : complex workflows became simpler, existing transformations were and order of magnitude faster to execute , and we improved our velocity to prototype and deliver new ones.

But above all, we gained enough confidence to project it for the years to come.

--

--