Introducing Presto

O'Reilly Media
Published in
6 min readSep 3, 2020

Editor’s Note: We think this piece is important because it reviews a tool designed to efficiently query vast amounts of data, allowing for analytics across an entire organization. As foremost experts in Presto, Matt Fuller, Manfred Moser, and Martin Traverso provide an introductory overview of this powerful open source, distributed SQL query engine. We’d love to hear from you about what you think about this piece.

The Problems with Big Data

Everybody is capturing more and more data from device metrics, user behavior tracking, business transactions, location data, software and system testing procedures and workflows, and much more. The insights gained from understanding that data and working with it can make or break the success of any initiative, or even a company.

At the same time, the diversity of storage mechanisms available for data has exploded: relational databases, NoSQL databases, document databases, key-value stores, object storage systems, and so on. Many of them are necessary in today’s organizations, and it is no longer possible to use just one of them. Dealing with this can be a daunting task that feels overwhelming.

In addition, all these different systems do not allow you to query and inspect the data with standard tools. Different query languages and analysis tools for niche systems are everywhere. Meanwhile, your business analysts are used to the industry standard, SQL. A myriad of powerful tools rely on SQL for analytics, dashboard creation, rich reporting, and other business intelligence work.

The data is distributed across various silos, and some of them can not even be queried at the necessary performance for your analytics needs. Other systems, unlike modern cloud applications, store data in monolithic systems that cannot scale horizontally. Without these capabilities, you are narrowing the number of potential use cases and users, and therefore the usefulness of the data.

The traditional approach of creating and maintaining large, dedicated data warehouses has proven to be very expensive in organizations across the globe. Most often, this approach is also found to be too slow and cumbersome for many users and usage patterns.

You can see the tremendous opportunity for a system to unlock all this value.

Presto to the Rescue

Presto is capable of solving all these problems, and of unlocking new opportunities with federated queries to disparate systems, parallel queries, horizontal cluster scaling, and much more.

Presto is an open source, distributed SQL query engine. It was designed and written from the ground up to efficiently query data against disparate data sources of all sizes, ranging from gigabytes to petabytes. Presto breaks the false choice between having fast analytics using an expensive commercial solution, or using a slow “free” solution that requires excessive hardware.

Designed for Performance and Scale
Presto is a tool designed to efficiently query vast amounts of data by using distributed execution. If you have terabytes or even petabytes of data to query, you are likely using tools such as Apache Hive that interact with Hadoop and itsHadoop Distributed File System (HDFS). Presto is designed as an alternative to these tools to more efficiently query that data.

Analysts, who expect SQL response times from milliseconds for real-time analysis to seconds and minutes, should use Presto. Presto supports SQL, commonly used in data warehousing and analytics for analyzing data, aggregating large amounts of data, and producing reports. These workloads are often classified as online analytical processing (OLAP).

Even though Presto understands and can efficiently execute SQL, Presto is not a database, as it does not include its own data storage system. It is not meant to be a general-purpose relational database that serves to replace Microsoft SQL Server, Oracle Database, MySQL, or PostgreSQL. Further, Presto is not designed to handle online transaction processing (OLTP). This is also true of other databases designed and optimized for data warehousing or analytics, such as Teradata, Netezza, Vertica, and Amazon Redshift.

Presto leverages both well-known and novel techniques for distributed query processing. These techniques include in-memory parallel processing, pipelined execution across nodes in the cluster, a multithreaded execution model to keep all the CPU cores busy, efficient flat-memory data structures to minimize Java Garbage collection, and Java bytecode generation. For Presto users, these techniques translate into faster insights into your data at a fraction of the cost of other solutions.

Presto was initially designed to query data from HDFS. And it can do that very efficiently. But that is not where it ends. On the contrary, Presto is a query engine that can query data from object storage, relational database management systems (RDBMSs), NoSQL databases, and other systems.

Presto queries data where it lives and does not require a migration of data to a single location. So Presto allows you to query data in HDFS and other distributed object storage systems. It allows you to query RDBMSs and other data sources. As such, it can really query data wherever it lives and therefore be a replacement to the traditional, expensive, and heavy extract, transform, and load (ETL) processes. Or at a minimum, it can help you with them and lighten the load. So Presto is clearly not just another SQL-on-Hadoop solution.

Object storage systems include Amazon Web Services (AWS) Simple Storage Service (S3), Microsoft Azure Blob Storage, Google Cloud Storage, and S3-compatible storage such as MinIO and Ceph. Presto can query traditional RDBMSs such as Microsoft SQL Server, PostgreSQL, MySQL, Oracle, Teradata, and Amazon Redshift. Presto can also query NoSQL systems such as Apache Cassandra, Apache Kafka, MongoDB, or Elasticsearch. Presto can query virtually anything and is truly a SQL-on-Anything system.

For users, this means that suddenly they no longer have to rely on specific query languages or tools to interact with the data in those specific systems.They can simply leverage Presto and their existing SQL skills and their well-understood analytics, dashboarding, and reporting tools. These tools, built on top of using SQL, allow analysis of those additional data sets, which are otherwise locked in separate systems. Users can even use Presto to query across different systems with the SQL they know.

Separation of Data Storage and Query Compute Resources
Presto is not a database with storage; rather, it simply queries data where it lives. When using Presto, storage and compute are decoupled and can be scaled independently. Presto represents the compute layer, whereas the underlying data sources represent the storage layer.

This allows Presto to scale up and down its compute resources for query processing, based on analytics demand to access this data. There is no need to move your data, and provision compute and storage to the exact needs of the current queries, or change that regularly, based on your changing query needs.

Presto can scale the query power by scaling the compute cluster dynamically, and the data can be queried right where it lives in the data source. This characteristic allows you to greatly optimize your hardware resource needs and therefore reduce cost.

Learn faster. Dig deeper. See farther.

Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Matt Fuller is a cofounder at Starburst, the Presto Company. Prior to founding Starburst, Matt was a director of engineering at Teradata, where he worked to build the new Center for Hadoop division within the company. As a major part of this, Matt worked to bring Presto to the enterprise market. Matt has managed a team contributing to the open source Presto project since 2015 and led the internal Presto product roadmap. Starburst was later formed from this team at Teradata. Manfred Moser is a community advocate, writer, trainer and software engineer at Starburst. Manfred has a long history of developing and advocating open source software. He is an Apache Maven committer, wrote the Hudson book and others, and continues to be active in the open source community and his projects. He is a seasoned trainer and conference presenter for CI/CD, Cloud Native, Agile and other software development tools and processes, having trained well over 20,000 developers for companies including Walmart Labs, Sonatype, and Telus. Martin Traverso is the cofounder of the Presto Software Foundation and CTO at Starburst. Prior to Starburst, Martin worked as a software engineer at Facebook where he saw the need for fast interactive SQL analytics. Martin and three other engineers worked to create what became Presto. Martin led the Presto development team and in the spring of 2013 Presto was rolled out into production, later made open source in the fall of 2013. Since then, Presto has gained wide adoption both internal and external to Facebook.



O'Reilly Media

O'Reilly Media spreads the knowledge of innovators through its books, video training, webcasts, events, and research.