I explored Presto DB for HyperLogLog

Published in

METLAB: My Experiments in Tech — Learning, Applying & Blogging

5 min readMay 27, 2024

Last week, I finished reading my first whitepaper on HyperLogLog and summarized the main points in this article. I also read about Presto DB through a Meta article on HLL. Since then, I’ve been eager to try out Presto DB myself.

Where Do I Begin?

Well, the official documentation, of course! I could have jumped straight onto ChatGPT to get all the steps from scratch. However, I’ve been relying on GPT lately and decided I didn’t want that much hand-holding this time. So, I swore to myself I’d only take GPT’s help if I was truly desperate.

Quickly Covering some of the Presto Jargon

Before diving into some concepts, it’s important to understand that Presto is not a replacement for traditional SQL databases like MySQL or PostgreSQL. While it understands SQL statements and provides standard database features, it’s designed for Online Analytical Processing (OLAP) workloads, not for Online Transaction Processing (OLTP).

Server Types

Coordinator: This is the “brain” of the Presto installation. It’s responsible for parsing statements, planning queries, and managing worker nodes.
Worker: Workers fetch data from connectors and exchange intermediate data with each other. When a worker server starts, it advertises itself to the Coordinator for task execution.
Resource Manager: This server aggregates data from all coordinators and workers, maintaining a global view of the cluster.

Query Execution Model

Statement: Just your standard SQL text. Presto uses ANSI-compatible SQL statements.
Query: More than just SQL text, a query includes configurations and components like stages, tasks, splits, etc., working together to produce a result. When a statement is parsed, it’s converted into a distributed query plan.
Stage: A query’s execution is broken down into stages, forming a tree-like structure. The root stage aggregates results from other stages.
Task: Stages don’t execute on Worker servers. Instead, they’re implemented as a series of tasks.
Split: These are sections of a larger dataset. Tasks operate on splits, and the Coordinator tracks which splits are processed by each task and which machines are executing the tasks.

Data Sources

Catalog: A catalog contains schemas from a connector (MySQL, JMX, Hive etc.). When we run a query in Presto, we’ll be running it against multiple catalogs.
Schema: They are a way to organize tables. Together, a catalog and schema define a set of tables that can be queried.
Table: This is same as in any relational database.

Hands-On!

I’ve outlined the installation steps in my repository. Notice that we haven’t created a “Resource Manager” server as our setup is small-scale and doesn’t require one. We just need to ensure we have Coordinator and Worker nodes set up. We’ll be using Trino which is essentially Presto. PrestoDB was renamed to Presto, and PrestoSQL was renamed Trino in 2020. At the core, both run the same code. So anywhere I mention Trino here on out, remember we are still dealing with Presto.

Assuming you’ve completed the installation, let’s create some fake data.

Creating Fake Data in JS

const mysql = require("mysql2/promise");
const faker = require("faker");

async function main() {
  try {
    // Connect to MySQL
    const mysqlConnection = await mysql.createConnection({
      host: "localhost",
      port: 3306,
      user: "trino", // update to use your credentials
      password: "trino",
      database: "connections"
    });

    console.log("Connected to MySQL");

    const createTable = `CREATE TABLE IF NOT EXISTS connections (
      id INT AUTO_INCREMENT PRIMARY KEY,
      user_id INT,
      timestamp DATETIME
    );`;

    await mysqlConnection.execute(createTable);
    console.log("Table created");

    const distinctUsers = new Set();
    const values = [];
    for (let i = 0; i < 500000; i++) {
      const userId = faker.datatype.number();
      values.push([userId, faker.date.recent()]);
      distinctUsers.add(userId);
    }

    const sql = "INSERT INTO connections (user_id, timestamp) VALUES ?";
    await mysqlConnection.query(sql, [values]);
    console.log("Data inserted");

    console.log("Number of distinct users:", distinctUsers.size);

    await mysqlConnection.end();
  } catch (err) {
    console.error("Error:", err.message);
  }
}

main();

Tip: I ran this script multiple times to create multiple rows of data.

HyperLogLog in Presto

The approx_distinct function in Presto returns the HyperLogLog data type. Let’s use it.

Fire up a terminal in your Coordinator container.
Start a Trino session by running trino in the terminal.

Testing MySQL Connection from Trino CLI

Run a test query:

SHOW CATALOGS;

Verify that mysql is listed as a catalog:

SHOW SCHEMAS FROM mysql;

Verify that connections is listed:

SHOW TABLES FROM mysql.connections;

Querying the approximate count

Now for the moment we’ve been waiting for:

SELECT approx_distinct(user_id) FROM mysql.connections.connections;

While the SQL query is straightforward, I want to quickly highlight that mysql is the “catalog” we’ve added while setting up Presto. mysql.connections refers to the “schema” and mysql.connections.connections refers to the “table”.

I also ran a query to get the distinct count so that we could calculate the error.

Error Calculation

Given

HLL Estimate: 102,875
Actual Count: 100,000
Relative Error (E): 2.3% = 0.023 (according to Presto documentation)

Expected Error Range

Lower Bound: Actual Count — (Actual Count x E) = 97,700
Upper Bound: Actual Count + (Actual Count x E) = 102,300

Comparing the HLL Estimate

Expected Range for 2.3% Error: 97,700 to 102,300
HLL Estimate: 102,875

The HLL estimate is slightly above the error boundary. While this may seem problematic, we need to consider other factors:

Randomness of Data: The fake data created has low randomness.
Probabilistic Nature of HLL: Presto is designed to work with data at scale — think petabytes. Given this, the estimate being slightly above the error range is still a win!

Conclusion

In this experiment, we saw how Presto DB and HyperLogLog can help us handle large amounts of data efficiently. Presto breaks down big tasks into smaller ones, making it easier to manage, while HyperLogLog gives us quick estimates of unique data points. Even though our results were slightly off from the expected range, they were still very close, showing that these tools are reliable for big data projects. This experiment shows how useful Presto and HyperLogLog can be when working with large datasets.