Supercharging Apache Superset

How Airbnb customized Superset for business intelligence at scale

Erik Ritter
Feb 9 · 9 min read

By: Erik Ritter, Grace Guo, Jesse Yang, John Bodley, and Zuzana Vejrazkova

Introduction

Superset is an open-source data exploration and visualization platform designed to be visual, intuitive, and interactive. It enables users to analyze data using its SQL editor, and easily build charts and dashboards. It began at Airbnb in 2015 as a hackathon project, was open sourced in 2016, and joined the Apache Incubator program in May 2017. After nearly four years of incubating at Apache, on January 21st, 2021, the Apache Software Foundation announced Superset as a top-level project. This coincided with the official release of Superset 1.0, the first major version and a turning point for the project.

As Superset progressed, Airbnb has remained a consistent contributor to the project. In the past, we discussed how this journey started and the introduction of new product features. This post will cover high-level technical details on how we scaled Superset to a mature BI tool that supports enterprise use cases and how working with the community enabled us to build custom integrations with our other enterprise tools and systems.

By the Numbers

  • 2,000 users
  • 50,000 SQL Lab queries
  • 6,000 and 125,000 dashboard and chart views† respectively

Views are defined as unique (day, user, entity) tuples and chart views encompass both dashboard and explorer surfaces.

Our ecosystem now comprises more than 100,000 tables and virtual datasets backing over 200,000 charts and 14,000 dashboards.

All this analysis, slicing and dicing, and decision making was performed by users across many job functions at Airbnb; over 25% of the company uses Superset on a weekly basis.

How We Scaled Superset

Cache Warmup Job

Domain Sharding

Database Engine Load Management

  • Route queries based on importance: We route queries to different clusters (using the DB_CONNECTION_MUTATOR configuration) to avoid resource contention; optimized dashboard queries are sent to one cluster while ad hoc SQL Lab and explore queries are sent to another.
  • Limit concurrency for each user: We limit each user to running only three queries simultaneously on our database engine. While this may seem like a small number, it’s actually more than sufficient given the cache warmup job previously mentioned. In the ideal case, very few queries actually make their way to our database engine, and the cache efficiently returns the results instead of rerunning the query.
  • Restrict large queries: We limit the size of queries that can be run to a certain memory size or partition count. This encourages users to create efficient queries of reasonable complexity.

Where Superset Really Shines

As an active contributor to the Apache Superset project, we were able to adapt Superset for our business needs via:

  • Helping to define the open-source roadmap
  • Proposing and implementing open-source features
  • Creating custom, in-house overrides or mutations. The Superset backend is written in Python which supports easy augmentation and customization via monkey patching.

When evaluating whether to build an internal solution or buy something off the shelf, one is faced with the 80/20 conundrum. An off-the-shelf solution will likely get you 80% of what you need, but the final 20% may be fraught with insurmountable challenges. Although Superset currently lacks some of the features and polish that other SaaS solutions offer, it makes up for these deficits in spades through the level of potential customization it provides.

Below are a few projects where Airbnb has leveraged or augmented Superset to enhance — either by streamlining or enriching — the user experience.

Metric Explorer

When designing Metric Explorer, we were faced with a dilemma. We wanted to provide a highly curated and vetted experience for slicing and dicing metrics that leveraged rich metadata and surfaced business context, while providing sufficient guardrails. However, we did not want to build another dashboarding tool and reimplement large swaths of Superset’s features.

We decided to solve this dilemma by factoring out the frontend foundation of Superset visualizations into the @superset-ui NPM packages. Not only did this solve the Metric Explorer use case, it also enabled any Superset installation to build other custom data applications that leverage the Superset backend. Figures 1 and 2 are Metric Explorer screenshots illustrating the Superset integrations.

The Metric Explorer UI, showing highlighted metrics in cards with line charts, and other metrics in a table
The Metric Explorer UI, showing highlighted metrics in cards with line charts, and other metrics in a table
Figure 1: Metric Explorer illustrating a collection of metrics powered by @superset-ui.
A single metric view in Metric Explorer, with Y/Y comparisons on the line chart and metadata available in the side pane
A single metric view in Metric Explorer, with Y/Y comparisons on the line chart and metadata available in the side pane
Figure 2: Metric Explorer illustrating a single metric where the header and left hand panel are powered by @superset-ui. Purposefully, Metric Explorer has limited slice-and-dice functionality, thus a link to Superset is also provided for more advanced analytics.

Security Manager and Data Access Policy Integration

We leveraged Superset’s custom security manager functionality, via the CUSTOM_SECURITY_MANAGER configuration option, and some RESTful API and Flask-AppBuilder overrides. Using this approach, we were able to seamlessly integrate Superset to adhere to the data access policy enforced by Airbnb’s internal security controller.

We wanted to further enrich the user experience by integrating the access request flow directly within Superset. This was achieved by adding frontend customizations, in conjunction with the custom security manager, which prompted users through a flow whenever we detected a priori (i.e., before the actual query was run) that they did not have the relevant permissions to access the underlying data. Figures 3–5 illustrate the data access policy integration within Superset.

By surfacing access requests in-place, instead of having users decipher cryptic database errors or directing them elsewhere, we were able to preserve the user flow and simultaneously provide the approvers with the necessary context about the request. This deeply integrated experience would most likely be very difficult to provide with other, less customizable tools.

The user is blocked from viewing data in Superset Explore view because they do not have permission to access this datasource.
The user is blocked from viewing data in Superset Explore view because they do not have permission to access this datasource.
Figure 3: A user is denied access if they do not have the relevant permissions to access either a datasource or a metric. Access can be requested in place (Figure 4).
A modal where the user can request access to protected data, with information about who will approve their request
A modal where the user can request access to protected data, with information about who will approve their request
Figure 4: The modal for requesting access to a restricted datasource or metric. In addition to being provided a reason, the approvers are also informed of the context for the request — i.e., which chart or dashboard the user is trying to access.
The user is now waiting for approvers to give them access to the datasource that backs this chart.
The user is now waiting for approvers to give them access to the datasource that backs this chart.
Figure 5: If the user is denied access but has a pending request the state of the request (which may require multiple approvers) is shown.

Metrics for the Masses

This single datasource now encompasses thousands of metrics and dimensions. Since most metrics and dimensions are generally scoped to products or projects by construction, the vast majority of metric-dimension combinations are not viable.

To avoid user frustration on accidentally selecting an infeasible combination of metrics and dimensions, we made an open source contribution to Superset’s chart controls, introducing a hook to asynchronously update control props based on user inputs. This allows us to filter out invalid metrics and dimensions according to what users have selected. Figures 6 and 7 illustrate the behavior.

The popover for viewing metrics from Minerva in Superset. 1000 dimensions for grouping by are available.
The popover for viewing metrics from Minerva in Superset. 1000 dimensions for grouping by are available.
Figure 6: The Superset query panel and metric popover for the Minerva virtual datasource containing an immense number of metrics and dimensions. Given the vastness of the datasource, most metric-dimension combinations would be invalid without adding custom logic to determine the feasible subset (Figure 7). Numbers are shown for illustrative purposes only.
After selecting a metric, only 100 dimensions are available for grouping by, followed by only 50 after selecting another dim.
After selecting a metric, only 100 dimensions are available for grouping by, followed by only 50 after selecting another dim.
Figure 7: The Superset query panel. By selecting the bookings metric, the viable set of dimensions has reduced from 1,000 to 100. Furthermore, by grouping by the dim_origin_city dimension, the viable set of dimensions further reduced to around 50 because the increased specificity had reduced the set of feasible Apache Druid datasources. Numbers are shown for illustrative purposes only.

A traditional BI tool would likely not be able to handle data at this scale or would result in severe usability issues given it would not be apparent to the user which dimensions are applicable to filter or group-by for a specific set of metrics.

Conclusion

In this post, we focused on how we evolved Superset for Airbnb’s large-scale needs, but other companies have leveraged Superset in different ways as well. You can learn about some of these here:

Acknowledgments

Apache Superset, Apache Druid, Apache Airflow, Superset, Druid, Airflow, Apache, and the Apache Superset logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

* Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis Labs Ltd. Any use by Airbnb is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Airbnb.

All trademarks, service marks, company names and product names are the property of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship and endorsement.

The Airbnb Tech Blog

Creative engineers and data scientists building a world…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Start a blog

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store