Do you really need a (Cloud) Data Warehouse in 2021?

Published in

Marc O’Polo Digital Intelligence

6 min readApr 9, 2021

By Andor Markus, Data Engineer

TLDR: Any of the cloud data warehouses are great solutions. You don’t need to go down into the rabbit hole of data lakehouses until you have specific requirements like end-to-end streaming, end-to-end testing or full control on the stored data.

Always start with why

In early 2020 Databricks wrote an article about data lakehouse. Lakehouse is a new, open data management paradigm that combines the capabilities of data lakes and data warehouses. They introduced the 5th generation data storage platform.

At Marc O’Polo in the past 10 years we were using Microsoft SQL Server as single source of truth. This system started to show its age and we decided to completely replace it with a cloud native solution. We got the luxury to rebuild our ETL pipelines from scratch, therefore our hands were not tied.

Cloud data warehouse

The top 4 players in the cloud data warehouse market are: Amazon Redshift, Azure Synapse Analytics, Google BigQuery and Snowflake. Amazon Redshift was lacking behind its competitors, because it did not support decoupled storage. End of 2019 AWS introduced the RA3 instance family which enabled decouple storage. However AQUA (Advanced Query Accelerator) for Amazon Redshift is still (April 2021) in preview for limited regions. Until AQAU is not generally available in the major AWS regions, Redshift on RA3 leaves performance on the table.

Which one to choose, I will not go into details. Elvin Li and Altexsoft wrote great comparisons about these products. As a general recommendation: go with your cloud provider offer. If it does not work out for you, consider Snowflake. Our high level observation was:

Amazon Redshift on RA3: you need some experience with sizing and performance tuning
Azure Synapse Analytics: auto-scaling (concurrency scaling) currently not offered.
Google BigQuery: you got little control over data backup.
Snowflake: not all of the cloud vendor regions supported. Pricing is complicated and gets very expensive quickly.

Advantages:

Data warehouse developers do not have to learn new technologies
Great selection of commercial ETL tools
Everything can be up and running within weeks and migration process can be started
Your options are limited (I’m not joking, that’s its good thing)

Disadvantages:

All of the ETL code will be in SQL, therefore it won’t be DRY (Don’t Repeat Yourself)
Ingestion of stream is possible, however stream processing is not possible and incremental processing is complicated
Vendor lock-in

Data lakehouse

If you are dealing with structured and semi-structured data nowadays, there is no strong reason to go with a pure datalake platform. In the past few years Apache Spark became the defacto tool for big data processing. It combines the possibility of batch and stream (structured streaming) processing, however it comes with a trade-off: speed. It is not as fast as its alternatives like Apache Flink, Kafka Streams or others. Chandan Prakash wrote a great article about this topic.

Spark SQL is a great milestone for data warehouse developers who are transitioning into data engineering. Spark SQL uses dataframes, which are “similar” to temporary tables in SQL and the logic is similar to SQL. However, you have the option to use a programming language (Python or Scala) for the ETL pipelines.

At Marc O’Polo we are moving from on-premise SQL Server into an Apache Spark + OSS Delta Lake solution, which is a huge transition. We are transitioning from data warehouse developers into data engineers. Data engineering is very close to software development, as the toolsets and methods are very close (object oriented programming, CI/CD, unit testing).

Most of the data lakehouse frameworks are built for data processing in Spark and supporting other frameworks like Presto, to read the data. Matured frameworks in this segment are: Delta Lake (Deltalake), Apache Hudi (AWS), Apache Iceberg (Facebook). Junjie Chen did a keynote about the “Thorough Comparison of Delta Lake, Iceberg and Hudi”

We, at Marc O’Polo ran two POCs (Proof Of Concepts): OSS Deltalake and Apache Hudi. You can read more about our experience in the article below

Medium

Edit description

medium.com

If you read more into the “marketing material” of Delta Lake or Apache Hudi, you will see that on the surface everything looks like a very promising alternative for any of the cloud data warehouses. It offers the best of two worlds: data warehouse (ACID transaction, schema enforcement and many more) and apache spark (rapid scaling, programming language for ETL). However this is one side of the coin, second is the complexity of the surrounding infrastructure and the limitless options.

In the mid 2020 two emerging frameworks tried to solve the data versioning of the data lakehouse. These are Dremio’s Project Nessie and Treeverse’s lakeFS. I call these solution “data lakehouse on steroids”. Can you image the possibilities what a git for data can provide to developing or testing purposes? But, I would not use it for production purposes or data branches together.

You can read more about the complexity of data lakehouse project in our:
Definite guide for Delta lake on AWS

Advantages:

You have full control on everything
You submit features and bug fixes to the project
Open source
End-to-end streaming
End-to-end testing
Data versioning

Disadvantages:

Much longer project timeline (6-12 months)
Your options are unlimited (I’m not joking, it’s often its bad thing)
You have to manage a lot more infrastructure

Conclusion

The two key deciding features are:

Do you need the full freedom and modularity that data lakehouses bring and are you willing to invest the working hours, or do you need a quick solution?
Do you need end-to-end streaming and end-to-end testing?

No one of the solutions is better or cheaper than the other. They are absolutely different and serve for different purposes. From a cost perspective, you are trading licence costs against labour costs. In the end the costs will be very close to each other. A fully featured data lakehouse can cost much more than a cloud data warehouse, however it brings so much more to the table.

From an organisation perspective: do you want to have the coolest state of the art tech stack, which attracts top talents and highly motivates your current employees?

At Marc O’Polo, in the end we decided to go with the data lakehouse architecture on AWS. Our business departments had very low data latency requirements (<15 minutes) after the data is available in the source system. We needed a solution which is maintainable on a long term (10 years). Therefore, we need a programming language for ETL jobs and end-to-end testing, which ensures data integrity in the long term.

You want to shape the digital future of the fashion industry? Start creating with us. The Digital Intelligence & Tech Teams at Marc O’Polo are always looking for talented and driven software engineers, data engineers, data scientists, data analysts, ML Engineers or project managers to join our team. Learn about life at Marc O’Polo and visit our career site.