Large Scale Data Transformations Anywhere

Trino was founded to democratize large-scale data processing for everyone. I'm now working on the next phase of Trino: Enable everyone to do resource-intensive SQL ETL data transformations across a wide variety of data sources.

Andrii Rosa
8 min readJan 26, 2022
Photo by Alistair MacRobert on Upslash

My Story

Soon after my graduation in 2012 I decided to join 4Sync, a Kyiv based startup specializing in consumer grade cloud storage solutions. I had an opportunity to work on content delivery optimizations and automatic prevention of service abuse. The lack of reliable data infrastructure that is easy to use was often one of the biggest challenges.

In order to make decisions we had to extract and join data from multiple data stores used in the company. We had to write custom map-reduce jobs in Java to extract data from Redis or MySQL as well as to read events from Apache Flume. The process involved recompilation and redeployment every time we had to extract or collect some extra data. Once the data was in HDFS we had to run a set of unreasonably slow Hive queries to transform the data and store it in a format that is optimized for querying.

I kept asking myself a question: “Why is there no solution that would hide all of this complexity under a single convenient interface?”

In 2015 I decided to join Teradata where I worked on Presto, a database engine that is designed to solve the data infrastructure problems we faced at 4Sync in an elegant way. Presto provided a pluggable connector interface that made it easy to connect multiple data sources together. And also it was blazing fast.

Matt Fuller and I presenting Presto Cost Based Optimizer at Data@Scale (Screenshot from the presentation)

Eventually, I joined Facebook where I worked on improving Presto capabilities to support ETL at petabyte scale. With Presto on Spark, we’ve allowed Presto to be used for enormous data preparation jobs that often take multiple CPU years, terabytes of distributed memory and multiple hours of runtime. However, Presto on Spark in Facebook benefits from a highly optimized distributed shuffle service, a custom resource manager and a customized internal version of Apache Spark.

Presenting Presto on Spark at Spark+AI Summit (Screenshot from the presentation)

I have chosen to move to Starburst, which builds an enterprise product powered by Trino (formerly PrestoSQL). I saw the opportunity at Starburst to solve large scale analytics problems for a broad variety of diverse customers and use cases with an all in one best of breed solution.

Community-driven projects bring in a much wider array of use cases that offer new challenges. Bringing Trino to companies outside of FAANG puts a heavier incentive on being pragmatic in our approach. I also couldn’t pass up on the opportunity to reunite with Trino founders Martin Traverso, Dain Sundstrom, David Philips, and Eric Hwang, and my old friends from my Teradata days, Matt Fuller, and Justin Borgman. All of whom have a laser focus on making data analytics far more accessible for the broader community.

What Drew Me To Trino For ETL

Trino Logo (from Wikimedia)

Though we originally envisioned Trino as an interactive analytics engine, it quickly got picked up as ETL engine as well. Salesforce, Shopify, Pinterest, and Slack, are among many companies that use Trino for ETL. In short, it can process large volumes of data quickly, while making it easy to bring in data from multiple sources. You can find many blogs about using Trino as ETL engine like in Seattle Data Guy and Salesforce Engineering Blog, and we did a drilldown on what makes Trino so fast at ETL here. Below are the key advantages to highlight:

Simple

There is a huge trend for low code and no code technology to lower the barrier of entry for less technical users. Despite not being a fancy UI-based tool, SQL is really the grandfather to these efforts, and is already widely adopted.

This is why we chose to stick to standard SQL statements to do ETL in Trino. If this capability did not exist in Trino, users would have to convert the standard SQL statements used in Trino to Spark SQL for ETL and then re-optimize the query to work efficiently in the Spark execution engine. To give a sense of how hard it is, just take a look at Stack Overflow. Some examples: 1, 2. This is why you see numerous data engineers pounding their heads trying to figure out how to convert their Trino SQL query to Spark SQL to do ETL.

Now, with Trino for ETL, users get to focus on delivering business value, rather than spending time trying to write and reason about SQL. Moreover, this simplicity means that almost anyone can do ETL. For example, someone working in a specific product’s ads team could transform the main ads table to get the views they need.

Optionality

Platform lock-in is real. In the early part of my career technologists would get locked into being a Hortonworks shop or an Oracle shop. Their CIOs would then need to negotiate massive, ever-increasing contracts.

What I see as the ideal world, the world I imagine is one in which individual technologists have the control: if they ever feel their needs are not being met, that costs are unreasonable, they have the ability to switch platforms.

Query Federation

Analytics should be simple. But today, data is stored in disparate locations. Transactional data is stored in a wide variety of relational and non relational databases, log data is stored in distributed file systems, operational data is stored in different time series databases. When M&As happen, the data stack gets even more disparate. Check out Doordash’s data stack to see how complicated analytics can end up getting.

Many data engineers spend more time centralizing data in a single data lake/warehouse than they spend doing their job: the data analysis. With Trino, we let people jump straight to doing what delivers business value: the data analysis itself.

Coming Soon: Changing The Game For Long-Running SQL ETL

Project Tardigrade Logo, courtesy of Brian Olsen

Over the last decade, data volumes have grown exponentially. The sheer volume of data necessitates users running scheduled SQL ETL jobs, rather than wait for hours for the interactive SQL query to load. SQL ETL is commonly used to process data for production systems, like Netflix’s recommendation engine. This is also common for day to day analytics needs, like calculating Facebook’s top-level metrics.

The problem with the approach of many of the existing engines in the market today is that they’re likely to fail if the query runs for a long time. You may configure it to retry on task failure in Airflow, but that means you can’t count on a consistent data landing time. Now, imagine a dating app not having the recommended match of the day because data isn’t available anywhere between 9am-6pm because of engine issues.

What was so game-changing about Hive was that it provided SQL support for long-running ETL queries. Apache Spark provided long-running ETL queries with orders of magnitude efficiency improvements over Hive. Coming soon, with Project Tardigrade, Trino will provide highly efficient long-running ETL queries, with the efficient developer experience that drew people to our platform. How are we going to achieve this?

Currently Trino relies on streaming execution and tries to keep the data in memory as much as possible. This allows to achieve high efficiency and low latency for interactive workloads. However it also means that the query execution is effectively all or nothing. This makes it significantly more difficult to use Trino for scheduled large scale data transformations.

We’ve talked to many users, and here is the list of problems that we have identified as the biggest pain points:

High Memory Queries

Trino tries to keep the data in memory as much as possible which allows it to achieve high efficiency and low latency for interactive workloads. As a downside there should always be enough memory available to store the intermediate state for memory intensive queries. Users often have to manually scale up their cluster to be able to run a small number of queries that require more memory than that is available. When most of the queries in the data transformation graphs are not memory hungry, keeping a larger cluster running is not cost efficient.

Resource Management

The streaming based execution in Trino is proven to be extremely efficient for interactive workloads. However while it allows to achieve low latency and high efficiency it makes resource management difficult. When queries that are currently running on the cluster start using more resources than available (especially in regards to memory) Trino cannot tolerate a surge by gradually slowing down the processing, instead the engine has to select and kill some random queries that are running. To make sure clusters will not get overloaded users have to pay attention to resource utilization of their queries and avoid submitting more than a cluster can handle. While this strategy worked successfully for many large companies that can afford building a complex resource management system outside of Trino, smaller companies would like the engine to handle this problem for them.

Fault Tolerance

Over the last decade the cloud infrastructure available has gone a long way. Hardware becomes more reliable and machines no longer crash as often as they used to. However with the increased popularity of spot instances the problem is viewed from a slightly different angle. A software that is capable of tolerating failures becomes significantly more cost efficient. To provide a predictable execution platform that our users can rely upon while using unreliable hardware we have to improve execution primitives to allow the engine to better tolerate failures and be more flexible at managing available resources.

You can find more details about the current status and the state of thoughts by following this issue on Github: https://github.com/trinodb/trino/issues/9101

What Does This All Mean?

For the average software engineer, data scientist, and business analyst, ETL SQL data transformation becomes much more accessible. They can use scheduled SQL jobs to feed transformed data into production systems like ad targeting. Or they can use scheduled SQL queries to speed up long-running queries: the most common example being determining top-level product metrics like Facebok’s Daily Active Users.

For the average data engineer, the daily workflow becomes orders of magnitude faster, because there is no longer a need to migrate data between systems to do the analytics, or to convert ad-hoc queries from Trino to Spark for ETL.

For data leaders, you get optionality, which safeguards you against the classic issues around vendor lock-in, while also reducing your cloud spend by using a highly efficient system.

Interested in learning more?

I recently spoke about Trino ETL and Tardigrade on the Trino Community Broadcast with Brian Olsen and Manfred Moser. I’ll be looking forward to telling you more about the project as new developments occur. You can also find me (arosa) on Trino Slack.

--

--

Andrii Rosa

Software Engineer at Starburst, Distributed Systems Enthusiast