A Gentle introduction to Apache Drill
I hear you saying yet another Apache project, but this one is really useful right away.
Drill designed to use SQL statements to query S3, HBase, HDFS, MapR-DB, Swift, Google Cloud Storage, local files, RDBMSs and garbage schemaless db which I am not going to mention its name(please don’t use it and use RethinkDB if you need one). This blog post explains
On a broader category level, Drill, Impala, Hive and Spark SQL all fit into the SQL-on-Hadoop category. But in terms of differentiation capabilities, Drill has the ability to allow data exploration on datasets without having to define any schema definitions upfront in the Hive metastore. Drill is built to work with schema that is dynamic, as well as data that is complex. Drill differs from Impala in that it can handle nested data better, and it can also work with data without having to define schema definitions upfront.
One of the nice things about Apache Drill is that you don’t need a cluster of machines to make it useful. Right at your desktop you can benefit it. Let’s give it an example;
If you have a JSON file on your disk, you will try to format using jsbeautifier.org or use your IDE formatting and try to find out data using search, or start typing some code with your favorite programming language. Slow, boring, daunting. You just want to get over with it. Good luck if your JSON file is over 10MBytes.
By following the following article https://drill.apache.org/docs/drill-in-10-minutes/
Drill folder also contains sample data; let’s issue some SQL statements, shall we?
SELECT * FROM cp.`employee.json` LIMIT 3;
SELECT avg(salary) FROM cp.`employee.json` where salary > 10000
SELECT avg(salary) as avg_salary FROM cp.`employee.json` where salary > 10000 and supervisor_id = 1
which will return you a nice ascii table view. cp stands for classpath
The second example uses Parquet format, if you are not familiar with ;
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
SELECT * FROM dfs.`<path-to-installation>/apache-drill-<version>/sample-data/nation.parquet`
How about TSV, CSV ?
Drill doesn’t require you to do anything as long as the file extension is proper. One catch is that you have to specify indexed columns; e.g columns, columns
select columns, columns, columns from dfs.`/home/bahadir/october_retention.tsv`;
As you may realized, we are using dfs.<file_path> syntax to query local files. dfs stands for Distributed File System, its configurable and it is a storage plugin. Check out the following page for details
You may even configure file formats to query compressed files. Nice!