Archive of stories published by Hadoop Noob

Homepage

Open in app

Hadoop Noob

All

2017

2018

2019

Hao Gao in Hadoop Noob

Nov 7, 2017

Flink Parquet Writer

From last post, we learned if we want to have a streaming ETL in parquet format, we need to implement a flink parquet writer. So Let’s implement the Writer Interface.

Writer V1:

public class FlinkAvroParquetWriterV1<T> implements…

5 responses

Hao Gao in Hadoop Noob

Feb 17, 2017

Benchmark: Spark SQL VS Presto

Cluster Setup:

Presto:

Presto 0.152 (latest)
1 c3.xlarge node as coordinator. No work scheduled on master
3 c3.2xlarge node as worker

2 responses

Hao Gao in Hadoop Noob

Feb 12, 2018

Dremio — best Parquet viewer

Sorry, I trolled on the title. Probably because I expected too much on it.

Back to early last year, Dremio came to our office and did a demo. It was a very informative talk and we asked a lot of Parquet related questions since they are contributors…

2 responses

Hao Gao in Hadoop Noob

Sep 22, 2017

A Realtime Flink Parquet Data Warehouse

Recently I am working on migrating our currently pipelines (mostly pyspark) to JVM based. Our plan is to use spark for batch processing and flink for real-time processing.

1 response

Hao Gao in Hadoop Noob

May 19, 2017

Query Kafka on Presto

Recently I am working on a new data pipeline, it need to consume the Kafka data and do some transformation then persist the data on hdfs. When I finished my data pipelines, I need to start integration test on staging cluster. When some records are missing on hdfs, I need to figure out…

7 responses

Hao Gao in Hadoop Noob

Jul 13, 2017

Presto Parquet Reader

Recently I am working on getting all our warehouse data queryable by Presto. We have lots of data in parquet format and our batch data pipelines are all spark jobs. They are normal ETL jobs. Data flows into Kafka, then Spark/Flink and finally are persisted on S3.

1 response

Hao Gao in Hadoop Noob

Feb 17, 2017

Recursive avro schema for parquet

I know it sounds stupid to use recursive data structure (e.g. a tree) in parquet, but sometime it happens. Why? Because you may need to consume some data which is not controlled by you.

Hao Gao in Hadoop Noob

Sep 10, 2018

Debug Flink OOM in Docker Container

Recently I am planning to deploy a new Flink pipeline. I tested on my local and staging environment. When I deploy it to serve the full traffic, the TaskManagers are killed randomly. Since I have enabled externalized checkpoint, I won’t lose any data or state. But…

Hao Gao in Hadoop Noob

Nov 3, 2017

Presto In Production

So what’s Presto

“Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”

Challenges

Hao Gao in Hadoop Noob

Apr 5, 2018

Druid parquet extension on Array/List type

As we rolled out and stabilized our Realtime Flink Parquet Data Warehouse, we are considering ingest parquet data into druid directly. We follow the guideline here, everything seems working well in the beginning. When our QA team runs integration test on…

These were the top 10 stories published by Hadoop Noob; you can also dive into yearly archives: 2017, 2018, and 2019.

About

Hadoop Noob

Elephant trainers

More information