Whats Presto ? Why its an important big data tool ?

Abhinav Vinci
2 min readJan 7, 2024

--

Presto is an open-source distributed SQL query engine for big data created by Facebook.

  • It is developed for fast analytic queries on large datasets.
  • SQL query engine → You can use SQL syntax!
  • It is designed for high performance
  • It can query data from various sources, such as Hadoop Distributed File System (HDFS), relational databases, and more.
https://thenewstack.io/presto-a-data-analytics-ecosystem-built-on-an-open-all-sql-platform/

Key features :

1. Distributed Architecture: It is designed to run queries on a cluster of machines, allowing for parallel processing and efficient use of resources.

2. Versatile : It can query data from multiple sources, including Hadoop, relational databases (like MySQL, PostgreSQL), and other data stores. It provides a unified interface for querying diverse datasets.

3. SQL Compatibility: It supports SQL syntax, making it familiar to users who are accustomed to relational databases.

4. High Performance: It is optimized for low-latency interactive queries, making it suitable for ad-hoc data analysis and business intelligence.

5. Complex queries, joins, subqueries : It supports complex queries, joins, subqueries, and aggregations, making it suitable for a wide range of analytical tasks.

Simple examples of a Presto query:

  • Selecting data from a table
SELECT column1, column2
FROM my_table
WHERE column3 > 100
  • Aggregation — Calculating the average value of a column
SELECT AVG(salary) AS average_salary
FROM employee_data;

Use Cases for Presto:

  1. Ad-Hoc Big Data Analytics: Presto is well-suited for ad-hoc analysis where users need to interactively explore and analyze large datasets. Example : Print raw columns and rows using simple select queries

2. Good for interactive querying

3. Connect to multiple data sources and use them in a single SQL query

But, there are other SQL query engines right?

Yes there are, like Hive and Spark SQL

Hive

  • Its good for batch processing . Its slower but works well with HDFS

Spark SQL:

  • Both Presto and Spark SQL are great at running queries on large distributed datasets with high performance, so the choice seems to depend on the use-cases these have become more popular for
  • Spark SQL is more popular for ML and other advanced data science use-cases because of Spark libraries for ML, graph computations while Presto is more popular for interactive querying and BI-like use-cases

--

--