Real-time Analytics with Presto and Apache Pinot — Part I

Xiang Fu
Apache Pinot Developer Blog
2 min readFeb 2, 2021

In this world, most analytics products either focus on ad-hoc analytics, which requires query flexibility without guaranteed latency, or low latency analytics with limited query capability. In this blog, we will explore how to get the best of both worlds using Apache Pinot and Presto.

Today’s compromises: Latency vs. Flexibility

Let’s take a step back and see how we analyze data with different trade-offs.

Flexibility takes time

Salvador Dali, The Persistence of Memory (1931)

Let’s assume your raw data is hosted in a data lake available for users to process using Spark/Hadoop compute jobs or from the SQL engine.

In this case, users have full flexibility to access all of the data, pick whatever they want, and apply whatever computations they need.

However, it takes a long time to retrieve the results, as raw data is read from a data lake(s3/HDFS/etc.), and the computations are launched on-demand (in an ad-hoc manner).

Take this simple example on Presto: an e-commerce business has a dimension table of user information (100k rows with schema <customer_id, name, gender, country, state, city, …>) and a fact table of user orders(1 billion rows with schema <order_id, date, customer_id, items, amount>).

To find the monthly sales of all female customers in California, we need to write a SQL query to join the two tables and then aggregate the results.

This query execution has 4 phases:

  1. The table scanning phase scans data from both tables: one full scan on table orders with columns: <date, customer_id, amount>And another full scan on table customers to fetch all the matched <customers_id, city> given predicate<state = ‘California’ AND gender = ‘Female’>.
  2. The join phase shuffles the data from both tables based on customer_id.
  3. The group-by phase again shuffles data based on <orders.date, customers.city>, then aggregate order amount.
  4. Reduce phase collects all the group-by results and renders the final results.

The data shuffling phase contributes majorly to the query latency, including data partitioning, serialization, network i/o, deserialization.

ETL trades off data freshness for low query time

…..

Read more at this new link!

--

--

Xiang Fu
Apache Pinot Developer Blog

Co-founder of StarTree, Apache Pinot Founding Member and PPMC Member