The Importance of Apache Drill to the Big Data Ecosystem
There are many lessons that our high school teachers tried to teach us. Some stuck and others went in one ear and out the other. The one that really stuck for me is that “history repeats itself.” The lesson here wasn’t meant to be strictly literal, but more so that we should learn from the mistakes of the past in order to prevent repeating them again in the future.
You might be wondering what bearing a history lesson may have on a technology project such as Apache Drill. In order to truly appreciate Apache Drill, it is important to understand the history of the projects in this space, as well as the design principles and the goals of its implementation. The lessons that have been learned from this space directly contribute to the reasons why Apache Drill is a serious big data tool with zero barriers to entry, which will enable organizations to leverage big data in ways that were not possible with other tools.
Inspired predominantly by Google’s Dremel, Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL that can query across data sources. It can handle flat fixed schemas and is purpose-built for semi-structured/nested data.
In the early days of Hadoop, SQL was thought of as somewhat of an archaic concept, if only for the reason that SQL was mostly used in relational databases. The concept of NoSQL was born because these big data systems were very different from traditional relational database systems, and access was restricted to writing new software as opposed to using off-the-shelf standard tools.
Later, SQL-derivative big data projects were created, and NoSQL was then repurposed into “Not only SQL.” These projects did not deliver enough SQL functionality to allow relational database users to fully function in a big data environment. Hive, an asynchronous, SQL-like, batch-oriented MapReduce job creator, is such a project. It was not created to be ANSI SQL-compliant. People wanting fast answers had to wait, which is counter to the general expectations set by traditional relational databases. Its biggest technical limitation is that it requires users to predefine schemas for the data it queries. This is very similar to a relational database, and thus it lacks the true flexibility to deal with rapidly changing data formats. Impala then came along, which leverages Hive’s metadata and enables queries to execute in a non-batch-oriented fashion. While it sped up the process of querying big data, it still had to use a predefined metadata store. In addition, Impala is implemented natively in C++, while the software in the Hadoop ecosystem is primarily written in Java.
In a time when data is being created at ever-increasing rates, people need to rethink how they handle the processes for getting data into their system and into a format that can be leveraged by their tools to find the information they require. With Drill, this burden will be significantly reduced, as the speed at which Drill operates means that expensive ETL processes may very well be abandoned, and data will be queried in place, in the same format it landed in. After all, why waste time converting the data between file formats when the tool can read the data in the format in which it landed? It is simpler for both the design and engineering of a platform to just use it as it is. The tools of the past were not capable of operating in this manner; Drill is finally opening the door to this inevitable future for real-time business intelligence. The walls between the data silos that contain important information will deteriorate. Receiving and storing data in real time will not suffer lags because of ETL processes that take an extensive amount of time, but instead will be able to happen on-the-fly to move business to real time with a much simpler approach.
At this point, there is little argument to be made that SQL is here to stay. With Apache Drill, organizations now have a solution that enables them to perform easy analysis of complex data structures and datasets using well-known SQL semantics. In essence, Drill has taken the approach of learning from history instead of repeating it. By understanding the limitations of other tools in this space, Drill is enabling businesses to leverage big data in new and powerful ways that have not previously been available from within this big data ecosystem.
Originally published at www.dbta.com on April 8, 2015.