my two cents

Sep 10, 2018

Debug Flink OOM in Docker Container

Recently I am planning to deploy a new Flink pipeline. I tested on my local and staging environment. When I deploy it to serve the full traffic, the TaskManagers are killed randomly. Since I have enabled externalized checkpoint, I won’t lose any data or state. But…

Hao Gao in Hadoop Noob

Jul 13, 2017

Presto Parquet Reader

Recently I am working on getting all our warehouse data queryable by Presto. We have lots of data in parquet format and our batch data pipelines are all spark jobs. They are normal ETL jobs. Data flows into Kafka, then Spark/Flink and finally are persisted on S3.

1 response

Hao Gao in Hadoop Noob

Sep 13, 2018

json vs msgpack

Which is better? It is really hard to say if we don’t give some context or constraints. Because if I could build it from scratch, I may choose neither.

So I have a cluster of Fluentd aggregators which streams data to treasure data. I need to fork the stream to kinesis…

Hao Gao in Hadoop Noob

Apr 17, 2018

Flink Forward 2018

It is a little late to write something about Flink Forward 2018. But I have to because we, well not me, actually co-present with Mesosphere about Flink on Mesos!

Hao Gao in Hadoop Noob

Jul 30, 2018

Notes on Apache Mesos Setup

Recently, I work on building a new data ingestion pipelines. I need to ingest data from kinesis and dump them on S3. Since I am familiar with flink and parquet, I decide to just use them. But before I can write and run some flink jobs, I need a cluster. I really like docker and I…

Hao Gao in Hadoop Noob

Feb 12, 2018

Dremio — best Parquet viewer

Sorry, I trolled on the title. Probably because I expected too much on it.

Back to early last year, Dremio came to our office and did a demo. It was a very informative talk and we asked a lot of Parquet related questions since they are contributors…

2 responses

Hao Gao in Hadoop Noob

Feb 17, 2017

Benchmark: Spark SQL VS Presto

Cluster Setup:

Presto:

Presto 0.152 (latest)
1 c3.xlarge node as coordinator. No work scheduled on master
3 c3.2xlarge node as worker

2 responses

Hao Gao in Hadoop Noob

Feb 17, 2017

Recursive avro schema for parquet

I know it sounds stupid to use recursive data structure (e.g. a tree) in parquet, but sometime it happens. Why? Because you may need to consume some data which is not controlled by you.

Hao Gao in Hadoop Noob

Apr 5, 2018

Druid parquet extension on Array/List type

As we rolled out and stabilized our Realtime Flink Parquet Data Warehouse, we are considering ingest parquet data into druid directly. We follow the guideline here, everything seems working well in the beginning. When our QA team runs integration test on…

Hao Gao in Hadoop Noob

Jun 2, 2017

Speed up Kafka queries on Presto

After I added protobuf and avro decoders into Presto, right now I can query my Kafka cluster through Presto. It saved me lots of time debugging data issues in my data pipelines. Basically If I didn’t see data in Kafka, I do not need to debug my downstream data pipelines.

3 responses

About

Hadoop Noob

Elephant trainers

More information

Followers

161