Dremio — best Parquet viewer

Hao Gao
Hadoop Noob
Published in
2 min readFeb 13, 2018

Sorry, I trolled on the title. Probably because I expected too much on it.

Back to early last year, Dremio came to our office and did a demo. It was a very informative talk and we asked a lot of Parquet related questions since they are contributors behind the Parquet (well, they original from twitter, i think). At that time, I also learned Apache Arrow which only has less than 100 github stars at that time. Apache Arrow seems very legit to me since I am dealing with different data format (Protobuf, Avro, Parquet) and different Systems (Mapreduce, Spark, Flink, Presto) every day. A unified data layer will be a Dream.

Anyway, until very recent I saw there is a community version of Dremio so I decide to give it a try. I downloaded it from Here. I wrapped it in a docker container and put it on our Mesos cluster under Marathon. I use all default conf and 1 CPU, 4 GB memory.

Now I need to choose a Data Source

Since we don’t have Hive or Redshift (we have Presto), I just choose HDFS. All our data are partitioned by year, month, day and hour and in parquet format, I feel like it should be very easy for Dremio to dynamically create a dataset based on the partitions (partition discovery) and schemas (schema merging). Think about it, it is just like you write down:

spark.read.option(“mergeSchema”, “true”).parquet(“data/test_table”)

Dremio should just persist the metadata somewhere.

But it just crashed. Maybe I need give it more resources or tune the conf a little bit. I think as a Product, it should be intuitive to use or evaluate. Of course if it is an open source project, it make sense it has some glitches.

So do I still use it. Yes I do. I use it as a parquet viewer. When I want to look at a single parquet file on HDFS or S3, it works well and its UI is pretty good :). My initial look is definitely biased. I just don’t have much time to play with it. I don’t think it suits our needs right now. To me, it just has a better dashboard, not much advantage against Presto. Although its data reflections seems very legit, our current presto setup is just fast enough for our current volume (40 cores handles Terabytes easily).

I will definitely update this when I play it more! If anyone can point me a docker image (maybe a cluster mode?) I can play with, it will be awesome.

I just remembered why I didn’t just use the mac version. I downloaded it, it is not starting :(

--

--