Query Engine looks interesting — my pre-check checklist

Published in

Wrong AI

1 min readMar 16, 2018

Modern Data Platforms decouple compute and storage. Storage is centralized and is typically either S3 or HDFS. There are a plethora of choices for the data query engines namely Spark, Presto, Impala, Drill, Hive, etc.

Following are the key dimensions I look for before getting serious about a query engine:

Data Sources/Formats Supported

Data Sources supported including Hadoop, Filesystem, NoSQL stores, etc.
Data formats supported including JSON, Parquet, ORCFile, Avro, etc.
Support for compressed data
Plug-&-Play support for specialized query engines
Reliance on Hive metastore for the data schema. An alternative is to push down queries to the storage layer and let the schema be resolved at the storage layer.

Query Processing Capabilities

Degree of ANSI SQL syntax and semantic fidelity
Support for generating incremental results
ETL support
Support for OLAP functions
Ability to run concurrent user queries
Ability to pushdown predicates to the storage layer
Indexing support vs columnar-storage

Update/Ingestion support

Real-time data ingestion
ACID transaction support
Support for updatable data/DML statements vs Append-only

Support for semi-structured data

Data Schemas: Schema on write vs schema on read
Queries over nested data

Operational support

Auto-scaling
Mid-query fault tolerance
Tools integration and ecosystem support

Performance Benchmarking numbers for Loading, Selection Query, Aggregation Task, Join Task, UDF Aggregation Tasks

Query Engine looks interesting — my pre-check checklist

Written by Sandeep Uttamchandani