Federated analytics with Starburst: unlock the value of decentralized data

Published in

Data Reply IT | DataTech

7 min readJul 13, 2023

In recent years, the advancement of data processing technologies has paved the way to the opportunity of gathering meaningful insights (which translate to business value) from the analysis the ever-growing data that companies collect every day.

There are however a number of factors that could tamper an organization’s ability to extract the hidden value of data. Among them, data fragmentation and heterogeneity play a central role. Indeed, in many companies, for historical and organizational purposes, data are often scattered around different systems, managed by different departments, and stored using different formats and technologies. These “data silos” make it extremely difficult to build a unified view over all the information of interest, severely limiting the possibility to perform proper data analytics.

Many companies are trying to tackle this challenge by moving huge quantities of data to centralized stores, using complex and expensive copy processes. While this approach works in theory, in practice it is not a perfect solution. Every time we copy data around, indeed, new problems arise, with respect to maintaining consistency across data copies, duplicate data governance requirements, and duplicate storage cost. What if there was a better solution?

Starburst and the “single point of access” approach

Starburst (https://www.starburst.io/) offers you a different approach to address this problem. It is an enterprise-level analytics engine that, instead of requiring you to move data into a single source of truth, aims to provide you with a single point of access to all of your data, without the need of centralizing information. You simply access the data wherever they reside, regardless of their format, size, and location, using a unified SQL interface.

Starburst is based on a hardened, production-tested version of the open-source Trino query engine, focusing on query federation at scale. Starburst extends Trino by providing enterprise-grade performance, connectivity, security, management, and support. In particular, with respect to the open-source version of Trino, it provides the following additional features and enhancements:

Insights UI, which allows you to monitor your cluster and access your data directly from a web-based interface;
Data products, which allow you to implement the data mesh paradigm to govern access to your data;
Connectors for additional data sources;
Enhanced performance, using proprietary optimizations such as Warp Speed;
Enhanced security, with integrated role-based access control (RBAC) at the table, column, and row level, end-to-end data encryption, and single-sign on (SSO) with your IdP.

Starburst’s offerings include both a fully managed, cloud-based, software-as-a-service solution (Starburst Galaxy), and a self-managed distribution that can be deployed either on-premise or in the cloud (Starburst Enterprise). Starburst also supports mixed or multi-cloud environments thanks to Stargate, a Starburst-specific connector that lets you link a local catalog on your cluster to a catalog on a remote cluster. Using Stargate, you can access data connected to a remote cluster (possibly running on-premise, or on another cloud provider), as if they were attached directly to your own cluster.

Get started with Starburst Enterprise

Starburst Enterprise Platform (SEP) is Starburst self-managed offering. It is built on Kubernetes, so you can deploy it on any cloud-based or on-premise system that runs Kubernetes. In our examples, we will refer to a demo deployment on AWS’s Elastic Kubernetes Service (EKS).

Before starting to explore data with SEP, we need to configure a few catalogs where data we want to query reside. Starburst offers 50+ connectors for different object and non-object storage layers (including relational databases via JDBC and a few NoSQL databases) you can leverage to connect to your data. Catalogs are set up in a declarative manner, so you don’t need to implement anything yourself.

An example configuration for two catalogs: an RDS-backed PostgreSQL instance, and the Glue Data Catalog

After you have created your SEP cluster and configured your connectors, you can access the cluster’s Insights UI. This allows you to check the state of your cluster, run interactive queries, manage data products, and configure RBAC to regulate data access.

Homepage of Starburst Enterprise’s Insights UI

By accessing the query editor, you have the possibility to explore and query all configured catalogs using standard ANSI SQL. You can query data from the same catalog or across different catalogs, leveraging on Starburst’s query federation capabilities.

If Starburst is granted enough permissions on catalogs, you can use the query editor not only to perform DQL operations, but also DML and DDL operations as needed. For example, you can create and populate schemas and tables on supported catalogs directly from Starburst.

You also have the possibility to create both views and materialized views, the difference being that materialized views persist results to a data store so that they do not need to be re-computed each time they are accessed. Materialized views are periodically refreshed, optionally in an incremental fashion.

Creating a materialized view in Starburst

Role-based access control and data filters

Starburst provides a built-in role-based access control (RBAC) authorization scheme you can use to define permissions for different users and groups. Using RBAC, you can define which users or groups are allowed to perform which operations on which data objects.

RBAC is based on roles, which bundle together one or more privileges that allow to perform certain actions on specific entities. Users are then assigned to one or more roles and may switch among the different assigned roles to be granted the privileges associated to the role.

Starburst’s roles and privileges interface

In Starburst, you manage roles using the dedicated page in the Insights UI. From there, you can assign roles to users and privileges to roles. Privileges can be granted in a very granular fashion: you may define permissions at the catalog, schema, table, and column level. For each grant, you can define the desired level of access (e.g., select, insert, delete, etc.), and possibly column masks and row-level filters.

Column masks allow to apply a masking operation to specific columns, so that your users will not have direct access to sensitive data in your tables. You may use one of the already available masking schemes (e.g., string hashing), or define your own using custom SQL expressions. Row-level filters allow to filter out a subset of your records that you don’t want your users to access, expressed as a WHERE clause. These two features further expand the capabilities of Starburst’s RBAC, removing the need of creating ad-hoc views for classes of users requiring different levels of access.

Nonetheless, if built-in RBAC does not satisfy your expectations, you may also configure Starburst to use different mechanisms for access control, such as Apache Ranger.

Configuring and accessing data products

In addition to the traditional approach for data organization based on the physical location of the different objects within the different catalogs, Starburst allows you to structure your data using a data mesh-oriented approach, centered around data products. Data products are collections of curated data assets within your organization, which map directly to a schema comprised of one or more datasets, either views or materialized views.

Each data product is characterized by a title, a description, a data owner, a reference domain, a set of associated tags, and possibly some usage examples that tell users how the datasets should be used. For each individual dataset, you may also provide column-level documentation and produce a preview of the data thereby contained.

You may limit access to a data product and its underlying datasets to specific users or groups using RBAC, as if they were tables in a physical schema. If your user is granted access to a given data product, you can query its datasets directly from the query editor, possibly in federation with other data objects registered in your cluster.

Connecting Starburst to third-party applications

Aside from the Insights UI, you can connect to a Starburst cluster from a number of clients. In this way, you can easily integrate Starburst as the query layer for your existing processes, making it the single point of access to all of your data.

Some of the most notable client platforms that Starburst can be connected to include:

The Trino command-line interface;
Python scripts using the dedicated library;
BI tools such as Microsoft Power BI and Tableau Desktop;
Data transformation tools such as DBT;
Data governance tools such as Collibra.

In addition to native integrations, Starburst exposes a JDBC/ODBC interface that you can exploit to connect to the cluster from within virtually any JDBC-compatible application. Regardless of what means they use to connect to the cluster, client platforms can fully leverage on Starburst’s capabilities, including federated data access, access control rules and data masking, and data products.

Conclusion

In this blog post, we presented Starburst, an integrated solution to enable analytics on distributed data while addressing the problems deriving from data fragmentation and heterogeneity.

Starburst’s single-point-of-access approach allows you to query your data wherever they reside, without requiring the implementation of complex and expensive data moving procedures. Exploiting the power of query federation, Starburst allows you to unlock the value of your decentralized data, extracting analytic insights in minutes by combining and integrating all of your data, wherever they reside.