Query Engine looks interesting — my pre-check checklist

Sandeep Uttamchandani
Wrong AI
Published in
1 min readMar 16, 2018

--

Modern Data Platforms decouple compute and storage. Storage is centralized and is typically either S3 or HDFS. There are a plethora of choices for the data query engines namely Spark, Presto, Impala, Drill, Hive, etc.

Following are the key dimensions I look for before getting serious about a query engine:

Data Sources/Formats Supported

  • Data Sources supported including Hadoop, Filesystem, NoSQL stores, etc.
  • Data formats supported including JSON, Parquet, ORCFile, Avro, etc.
  • Support for compressed data
  • Plug-&-Play support for specialized query engines
  • Reliance on Hive metastore for the data schema. An alternative is to push down queries to the storage layer and let the schema be resolved at the storage layer.

Query Processing Capabilities

  • Degree of ANSI SQL syntax and semantic fidelity
  • Support for generating incremental results
  • ETL support
  • Support for OLAP functions
  • Ability to run concurrent user queries
  • Ability to pushdown predicates to the storage layer
  • Indexing support vs columnar-storage

Update/Ingestion support

  • Real-time data ingestion
  • ACID transaction support
  • Support for updatable data/DML statements vs Append-only

Support for semi-structured data

  • Data Schemas: Schema on write vs schema on read
  • Queries over nested data

Operational support

  • Auto-scaling
  • Mid-query fault tolerance
  • Tools integration and ecosystem support

Performance Benchmarking numbers for Loading, Selection Query, Aggregation Task, Join Task, UDF Aggregation Tasks

--

--

Sandeep Uttamchandani
Wrong AI

Sharing 20+ years of real-world exec experience leading Data, Analytics, AI & SW Products. O’Reilly book author. Founder AIForEveryone.org. #Mentor #Advise