Hao Gao
Hao Gao
Jul 14, 2017 · 2 min read

Recently I am working on getting all our warehouse data queryable by Presto. We have lots of data in parquet format and our batch data pipelines are all spark jobs. They are normal ETL jobs. Data flows into Kafka, then Spark/Flink and finally are persisted on S3.

After I deployed Presto cluster on Mesos cluster, everything works great. I am really happy since Presto is really stable and fast. Finally I created external tables in hive metastore to map most of the data on S3.

Presto provides two parquet readers. The older one depends hive (hive 1.2 shaded), and new one contributed by Uber. Since new one has lots of improvement, I decide to give it a try. But to my surprise, it doesn’t work well :(

To summarize:

  1. List is not working
  2. Null handling in struct

For #1 , I found :

https://github.com/prestodb/presto/issues/8133

https://github.com/prestodb/presto/issues/5316

For #2, I found:

https://github.com/prestodb/presto/pull/7601

https://github.com/prestodb/presto/issues/7947

In this post I will just focus on #1

Nested structure in parquet probably not a good idea. But I think sometimes it just make lots of sense to use nested structure to represent data.

After I dig more into the issues. I also found out different systems (spark/hive/parquet-avro) could have different annotations to represent Array.

The following one is written by thrift parquet

message ParquetSchema {
optional group persons (LIST) {
repeated group persons_tuple {
required group name {
optional binary first_name (UTF8);
optional binary last_name (UTF8);
}
optional int32 id;
optional binary email (UTF8);
optional group phones (LIST) {
repeated group phones_tuple {
optional binary number (UTF8);
optional binary type (ENUM);
}
}
}
}
}

Next, this is the example written by spark1.6

message spark_schema {
optional group persons (LIST) {
repeated group list {
optional group element {
optional group name {
optional binary first_name (UTF8);
optional binary last_name (UTF8);
}
optional int32 id;
optional binary email (UTF8);
optional group phones (LIST) {
repeated group list {
optional group element {
optional binary number (UTF8);
optional binary type (UTF8);
}
}
}
}
}
}
}

So as you can see, To represent List, there are 2-level structure and 3-level structure. So either parquet reader handles all situations or pipelines generate “presto readable structure” — which is the second case.

Surprisingly Spark can read both of them. So I dig into spark source code, I found this:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L456

Okay then, we got all we need, lets copy some code to make it happen. Finally I got it working here:

https://github.com/hadoop-noob/presto/tree/parquet_reader_nested_struct

If you face the same issue, you can:

  1. Use old parquet reader
  2. Wait for above issues resolved
  3. Try my patch (which is small) and cross your figures
  4. Patch yourself!

Feel free to comment down below or message me!

Hadoop Noob

Elephant trainers

Hao Gao

Written by

Hao Gao

Hadoop Noob

Elephant trainers

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade