“Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”
Recently I am working on getting all our warehouse data queryable by Presto. We have lots of data in parquet format and our batch data pipelines are all spark jobs. They are normal ETL jobs. Data flows into Kafka, then Spark/Flink and finally are persisted on S3.
Recently I am working on a new data pipeline, it need to consume the Kafka data and do some transformation then persist the data on hdfs. When I finished my data pipelines, I need to start integration test on staging cluster. When some records are missing on hdfs, I need to figure out…
Cluster Setup:
Presto: