The evolution from SQL-based query engines for big data to data lake engines including its impact on data warehouses and data lakes
From a high-level perspective, most data and analytics solutions have been built up the same way for many years now. In a nutshell, consisting of diverse integration processes to load all data into one central place serving as the single source of the truth for upcoming data modeling and analytical use-cases. While in the older days those central places were mostly expensive and inflexible tightly coupled hardware/software-systems, nowadays there is usually taken advantage of the cloud and distributed architectures including the separation of compute and storage. Yet, despite the enormous technological progress made in recent years, the overall approach of centralizing data is still the most obvious way to get the most out of its data and to conduct proper data management.
So what is wrong with this centralized approach and what does it have to do with distributed query engines in the first place?
To start with, there is nothing to say against it— quite the opposite, in fact — building massive data warehouses or data lakes containing all the data at one place in a clear and fresh state is often the only way to ensure consistence so that everyone in using the same definitions. In this regard, especially cloud data lake services such as Microsoft’s Azure Data Lake Storage or Amazon Web Service’s S3 present an interesting twist by enabling even more advantages of centralization due to its very scalable and inexpensive way to park vast amounts of any kind of data.
Nevertheless, there are plenty of reasons making it increasingly difficult to centralize all the data. The number of data sources is growing and so is the versatility of datasets required to satisfy the growing number of different business domains relying on that data. Business users in general are coming closer to data demanding more flexibility as opposed to static prebuilt datasets. The same applies to advanced analytics use cases, methods often need to be applied on raw and untransformed data. Moreover, in some cases organizations are even forbidden from moving data around due to any internal or external regulations. In other cases, there still exist pipelines on top of centralized data loading it further into any downstream systems to meet all analytical requirements. This, in turn, could even end up in the same lock-in experienced with traditional on-premises systems. Or use-cases where centralizing data is insufficient to justify the efforts involved or maybe the data is too big and will take too long to move. And so forth…
So what to do in such circumstances?
There is a plethora of options today when it comes to analytical solutions and for managing their data. Not only the different providers including their offers, but also the different kind of technologies is overwhelming and the pace of technological advancement is faster than ever before. There also isn’t a clear winner, they all together will help in converting more of those data calories into something useful, no doubts about that. However, there is indeed a discernible trend towards SQL-based distributed query engines helping to cope with the data explosion. This also confirms the product lineup of existing data and analytics services provider and their latest developments. They all seek to seamlessly integrate those cost-efficient cloud storages and to allow interactive SQL queries on top of it using exactly such query engines. Thus, they can fill in the missing gaps mentioned above and allow mature enterprises to achieve expanded Big Data capabilities while maintaining organizational and platform stability by still having a central point of truth.
The basic idea behind distributed query engines is nothing more than data virtualization and the attempt to create an abstraction layer that provides data access across disparate data sources. The difference to traditional data virtualization software (Linked Servers, DBLink, etc.) is that you can query relational and non-relational data together in a scale-out fashion for better query performance. Hence, the word distributed not only refers to the query itself, but also to computation and storage. They are basically made for intensive OLAP queries and as such aren’t that fragile and inconsistent in terms of performance.
The technology used for this was initially or rather is still often called SQL-on-Hadoop which relies on MPP (Massive Parallel Processing) engines. It allows to query and analyze data stored on HDFS (Hadoop Distributed File System) using familiar SQL-like language to hide the complexity of MapReduce/Tez and making it much more accessible to database developers. Hive was arguably the first SQL engine on Hadoop and is still widely used for batch-style data processing as it has proven to be very robust apart from the advancements made over the years. Hive translates SQL queries into multiple stages and stores intermediate results into disks. In the meantime, other purpose-built tools were developed natively in the Hadoop ecosystem such as Impala, supporting also HBase as a data source. Compared to Hive, it takes advantage of in-memory and caching technologies which is more meant for interactive analysis than long-running batch jobs— another example in this category would be SparkSQL. All of them require a metadata definition done upfront also known as schema-on-read such as Views or External Tables. This definition is stored in a centralized store such as Hive metastore.
As technology evolved, more openness was required and not strictly tied to Hadoop but rather loosely coupled supporting many different kind of other databases. This way, query engines enabled a kind of plug-and-play discovery over a huge amount of data without any prerequisites and preparations. Further, Standard ANSI SQL was provided as the interface which made it even more accessible to data analysts and developers. At the same time, schema did not need to be defined necessarily upfront anymore, some engines even can resolve it automatically at the original storage layer by pushing queries down such as Drill. Another pioneering tool in this space is Presto which even allows to query live streaming data coming from Kafka and Redis. Presto is an in-memory distributed SQL query engine developed by Facebook for exactly that needs, to drive interactive analytics across disparate datasets. For companies such as Netflix, Twitter, Airbnb or Uber it is crucial for their daily business, otherwise they would not be able to process and analyze their petabytes of data. Presto can be used along with many different BI tools, including Power BI, Looker, Tableau, Superset or any other ODBC- and JDBC-compliant tool. In this context, the name SQL-on-Anything was finally coined for the first time.
Data Lake Engines
The technological approach of data lake engines isn’t quite different. After all, it is simply just data virtualization and combining data from different sources. They usually differ in providing more features with regard to data modeling, data transformation, data linage and data security. Generally, they are also more directed towards the cloud and one could argue that they meanwhile come together with a rich user interface bringing a kind of data self-service philosophy to also non-technical users. This approach allows to take full advantage of data centralization in public clouds and enables interactive analysis at lower cost thanks to the price elasticity of the cloud, without any lock-in risk. Data Lake Engines do not necessarily support more data source either, but due to its late arrival they can take advantage of the latest technologies from the ground up. Databricks, for instance, recently announced SQL Analytics, powered by their Delta Engine that directly query Delta Lake tables on data lakes. Further, it provides a SQL-native interface for data exploration and dashboards can be shared among each other. Another very promising tool and one of my favorites in this regard is Dremio which is basically open-source but backed by its company of the same name offering a commercialized enterprise version with some additional features.
Dremio is building a direct bridge between BI tools and the queried data source systems as opposed to traditional multi-layer architectures. The main technologies used behind the scenes are Drill, Arrow, Calcite, and Parquet. This combination provides a schema-free SQL for variety of data sources plus a columnar in-memory analytics execution engine with push-down functionality and it is possible to easily materialize queries for improving query performance. Arrow is by the way meanwhile considered as a kind of de-facto standard for in-memory analytics.
In the end, whether or not to physically centralize data totally depends on the use case and such engines give you an alternative solution by queering data where it actually lives. Likewise, even though such query engines might seem kind of one-size-fits all solutions, there will still be use cases which can’t be solved on the fly and still require data integration processes and proper data modelling, not to mention real-time data based on microservices architectures. It is also important to note that older distributed query engines won’t go away that fast such as Hive, they serve too well in already so many existing data architectures and besides seamlessly integrate with most of the newer technologies. Let’s see what the future will bring.