Druid parquet extension on Array/List type

Hao Gao
Hadoop Noob
Published in
3 min readApr 6, 2018

As we rolled out and stabilized our Realtime Flink Parquet Data Warehouse, we are considering ingest parquet data into druid directly. We follow the guideline here, everything seems working well in the beginning. When our QA team runs integration test on the dashboard which powered by druid, they found some fields are not correct. As you can tell from the title, all these fields are List type in parquet.

Let me give an example. Let’s say, we have a field which is List of String, data looks like {“Alice”, “Bob”}. When we use the druid-parquet-extension, the data shows up {“element”: “Alice”, “element”: “Bob”}. If you by chance read my previous blog about parquet, we know in the parquet data, the list is represented as three level structure. something like

optional group name_tag_array(LIST) {
repeated group list {
optional binary element (UTF8);
}
}

So it looks like the extra element comes from the schema. From what we have so far, the problem happens in the parquet reader.

We took a further look at druid-parquet-extension, it says it depends on druid-avro-extension. Okay, it is not hard to guess, druid hadoop indexer actually reads parquet files from HDFS or S3 and then use Avro object to hold the data in memory.

If you confused about why Avro is needed, I suggest you read this blog. Quote from that blog:

1. Storage formats, which are binary representations of data. For Parquet this is contained within the parquet-format GitHub project.

2. Object model converters, whose job it is to map between an external object model and Parquet’s internal data types. These converters exist in the parquet-mr GitHub project.

3. Object models, which are in-memory representations of data. Avro, Thrift, Protocol Buffers, Hive and Pig are all examples of object models. Parquet does actually supply an example object model (with MapReduce support ) , but the intention is that you’d use one of the other richer object models such as Avro.

Parquet is storage formats. Avro can be used as both object model and storage format. When we want to use parquet in our program (spark/flink/java/python), we need to have a object model to hold the data which could be Protobuf, POJO or Avro.

In this druid case, the issue is in the object model converters.

How can we prove our assumption? The easiest way is debugging it in your favourite IDE. I git cloned the druid repo and put several break points in the druid-parquet-extension modules. Here is what I got from their parquet file in the test folder

Look at the {“array”: “en”}, the extra data is array instead of element simply because in the parquet schema, it is array. The actual schema is:

optional group language (LIST) {
repeated group bag {
optional binary array (UTF8);
}
}

All right, to conclude, if we could fix the schema converter, it will be resolved. Let me give you a pointer to get started. The converter is here. Let me know if anyone patches it. Me, no time for open source, need to make some money :).

--

--