Chapter 1.Introduction To PrestoDB- Massive Parallel Processing

Published in

DevOps DeepDive

3 min readApr 10, 2021

What is Prestodb?

Prestodb(Presto) is an opensource SQL query engine which is used for speeding up the execution of analytics queries against data of any size.
It can be used with relational databases like MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata ; and non-relational databases like Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase.
Presto queries data where it is stored , and there is no need for moving data to any analytics system separately. Query execution takes place in parallel over a memory based architecture which allows to return response in seconds.

How Prestodb works?

Presto runs on Hadoop, and uses a similar architecture to classic massively parallel processing (MPP) database management system. It generally has one coordinator node and various worker nodes which works in sync with the coordinator.

When a user submits an SQL query to the coordinator , it uses a custom query and an execution engine to parse, plan and schedule a distributed query plan amongst the worker nodes.

Once the query is compiled , Presto processes the request into multiple stages amongst the worker nodes and this processing happens in-memory and pipe-lined across the network in stages so that unnecessary I/O overhead can be avoided. This is how parallelism works here in query execution and lead to fast processing. The more number of worker nodes , more parallelism , and even faster results.

Presto and Hadoop

Presto is designed for fast, interactive queries on data in HDFS, and others.
Presto doesn't have its own storage system unlike Hadoop, thus it acts as an complimentary to Hadoop. Presto can be used with any implementation of Hadoop, and is packaged in the Amazon EMR Hadoop distribution.

Presto-Architecture

Components:

Coordinator:
Coordinator is the main component of the Presto installation and it is must to have. Its job is to parses statements, plans queries, and manages Presto worker nodes, and it keeps a track of all the workers’ activity to coordinate queries. It gets results from the Workers and returns final results back to the client. Coordinators connect with workers and clients via REST.

Worker:
Worker runs tasks and processes data. These nodes share data across each other and get data from the Coordinator. Once worker node is up and running , it will detect the co-ordination and makes itself available for task executions.

Presto has several important components that manages the data itself.

Catalog:
Information about where data is located is been managed by Catalog.
Catalog stores schemas and the data source. When an SQL statement gets executed in Presto, it indicates , its running on one or more catalogs.
Catalogs are defined in properties files stored in the Presto configuration directory.

Connector:
To integrate Presto with external data sources like object stores, relational databases, or Hive , connectors are used.

This was an overview on Prestodb and its architecture.

Hope this was helpful!
See you in next Chapter!
Happy Learning!
Shivani S.

Chapter 1.Introduction To PrestoDB- Massive Parallel Processing

Written by Shivani Singh