Query Engine looks interesting — my pre-check checklist
Modern Data Platforms decouple compute and storage. Storage is centralized and is typically either S3 or HDFS. There are a plethora of choices for the data query engines namely Spark, Presto, Impala, Drill, Hive, etc.
Following are the key dimensions I look for before getting serious about a query engine:
Data Sources/Formats Supported
- Data Sources supported including Hadoop, Filesystem, NoSQL stores, etc.
- Data formats supported including JSON, Parquet, ORCFile, Avro, etc.
- Support for compressed data
- Plug-&-Play support for specialized query engines
- Reliance on Hive metastore for the data schema. An alternative is to push down queries to the storage layer and let the schema be resolved at the storage layer.
Query Processing Capabilities
- Degree of ANSI SQL syntax and semantic fidelity
- Support for generating incremental results
- ETL support
- Support for OLAP functions
- Ability to run concurrent user queries
- Ability to pushdown predicates to the storage layer
- Indexing support vs columnar-storage
Update/Ingestion support
- Real-time data ingestion
- ACID transaction support
- Support for updatable data/DML statements vs Append-only
Support for semi-structured data
- Data Schemas: Schema on write vs schema on read
- Queries over nested data
Operational support
- Auto-scaling
- Mid-query fault tolerance
- Tools integration and ecosystem support
Performance Benchmarking numbers for Loading, Selection Query, Aggregation Task, Join Task, UDF Aggregation Tasks