Benchmark: Spark SQL VS Presto

Cluster Setup:

Presto:

  • Presto 0.152 (latest)
  • 1 c3.xlarge node as coordinator. No work scheduled on master
  • 3 c3.2xlarge node as worker
  • 8 vCPUs, 15GB mem per worker node
  • Max query per node 9GB
  • Hive metastore and thrift server running on coordinator node

Spark

  • Spark 1.6.1 with default params
  • 1 c3.xlarge node as master
  • 3 c3.2xlarge node as workers
  • 8 vCPUs, 15GB mem per worker node

Tuning made on Presto:

  • distributed-joins-enabled=false
  • optimizer.processing-optimization=columnar_dictionary
  • hive.parquet-optimized-reader.enabled=true hive.parquet-predicate-pushdown.enabled=true

Benchmark result:

I don’t know why presto sucks when perform join on the large data set.