Tagged in

Big Data

Hadoop Noob

Elephant trainers

More information

Followers

161

More, on Medium

Big Data

Hao Gao in Hadoop Noob

Apr 17, 2018

Flink Forward 2018

It is a little late to write something about Flink Forward 2018. But I have to because we, well not me, actually co-present with Mesosphere about Flink on Mesos!

Hao Gao in Hadoop Noob

Apr 9, 2018

Flink Fault tolerance using externalized checkpoint

As I am writing this, Flink already on 1.4 release and 1.5 snapshot is already out. But we are still on flink 1.3.2

I want to talk a little bit about Flink externalized checkpoint. Flink’s checkpoint is a great…

Hao Gao in Hadoop Noob

Apr 5, 2018

Druid parquet extension on Array/List type

As we rolled out and stabilized our Realtime Flink Parquet Data Warehouse, we are considering ingest parquet data into druid directly. We follow the guideline here, everything seems working well in the beginning. When our QA team runs integration test on…

Hao Gao in Hadoop Noob

Feb 12, 2018

Dremio — best Parquet viewer

Sorry, I trolled on the title. Probably because I expected too much on it.

Back to early last year, Dremio came to our office and did a demo. It was a very informative talk and we asked a lot of Parquet related questions since they are contributors…

2 responses

Hao Gao in Hadoop Noob

Nov 7, 2017

Flink Parquet Writer

From last post, we learned if we want to have a streaming ETL in parquet format, we need to implement a flink parquet writer. So Let’s implement the Writer Interface.

Writer V1:

public class FlinkAvroParquetWriterV1<T> implements…

5 responses

Hao Gao in Hadoop Noob

Nov 3, 2017

Presto In Production

So what’s Presto

“Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”

Challenges

Hao Gao in Hadoop Noob

Sep 22, 2017

A Realtime Flink Parquet Data Warehouse

Recently I am working on migrating our currently pipelines (mostly pyspark) to JVM based. Our plan is to use spark for batch processing and flink for real-time processing.

1 response

Hao Gao in Hadoop Noob

Feb 17, 2017

Benchmark: Spark SQL VS Presto

Cluster Setup:

Presto:

Presto 0.152 (latest)
1 c3.xlarge node as coordinator. No work scheduled on master
3 c3.2xlarge node as worker

2 responses