Supercharging Apache Superset
How Airbnb customized Superset for business intelligence at scale
At Airbnb, many employees rely on data every day to do their jobs. While several different tools are used for analysis, at the core of Airbnb’s self-serve business intelligence (BI) solution is Apache Superset™ (“Superset”).
Superset is an open-source data exploration and visualization platform designed to be visual, intuitive, and interactive. It enables users to analyze data using its SQL editor, and easily build charts and dashboards. It began at Airbnb in 2015 as a hackathon project, was open sourced in 2016, and joined the Apache Incubator program in May 2017. After nearly four years of incubating at Apache, on January 21st, 2021, the Apache Software Foundation announced Superset as a top-level project. This coincided with the official release of Superset 1.0, the first major version and a turning point for the project.
As Superset progressed, Airbnb has remained a consistent contributor to the project. In the past, we discussed how this journey started and the introduction of new product features. This post will cover high-level technical details on how we scaled Superset to a mature BI tool that supports enterprise use cases and how working with the community enabled us to build custom integrations with our other enterprise tools and systems.
By the Numbers
As the longest-running Superset environment, Airbnb has spent the last five years working closely with open-source contributors to build a product that scales and grows with the company. This has culminated in the ability to enable staggering amounts of data-driven intelligence. Superset at Airbnb handles on a weekly basis around:
- 2,000 users
- 50,000 SQL Lab queries
- 6,000 and 125,000 dashboard and chart views† respectively
† Views are defined as unique (day, user, entity) tuples and chart views encompass both dashboard and explorer surfaces.
Our ecosystem now comprises more than 100,000 tables and virtual datasets backing over 200,000 charts and 14,000 dashboards.
All this analysis, slicing and dicing, and decision making was performed by users across many job functions at Airbnb; over 25% of the company uses Superset on a weekly basis.
How We Scaled Superset
To support Airbnb’s scale, we built several custom features in and around Superset. These configuration options, daily offline jobs, and warehouse optimizations were key to scaling Superset.
Cache Warmup Job
Of all dashboards viewed each day at Airbnb, 90% are viewed more than once. This, combined with the fact that currently most new data only lands once a day through our ETL (extract, transform, load) jobs, means that caching the results of dashboard charts daily can dramatically improve performance for the majority of users. Using Apache Airflow™, we implemented an effective offline cache warmup strategy focused on warming up recently viewed dashboards, resulting in an 86% cache hit rate for Presto®-backed charts. Since Superset natively supports caching chart requests in Redis™*, we were able to programmatically load the popular dashboards during non-business hours, thus reducing the load on our query engines, Presto® and Apache Druid™, during peak hours. This improved cached chart load times from over 30 seconds when uncached to under four seconds.
When loading a dashboard in Superset, individual requests are fired off concurrently for every visible chart on the dashboard. Although this works when requests are fast and dashboards are small, with dashboards containing many charts we quickly run into issues with browser settings. Most modern browsers limit the number of concurrent requests made to a single domain (i.e., the Superset API) to six, resulting in a bottleneck that slows down large dashboards. To handle this issue, we built the SUPERSET_WEBSERVER_DOMAINS configuration option. By setting this option, admins can allow as many concurrent dashboard queries as their database engine can support (an effective cache may be required to ensure that the engine is not overloaded). We route four different subdomains to our web server, supporting up to 24 concurrent queries on a single dashboard. This functionality was key to allowing users to build complex dashboards and improving performance.
Database Engine Load Management
While Superset allows for a lot of native optimizations, some performance and stability improvements can only be done at the database engine level. Because of complex business needs and the size of our datasets, many dashboard-triggered queries take 25 seconds on average to execute. Therefore, we took the following steps to make sure our database engine clusters do not become overloaded:
- Route queries based on importance: We route queries to different clusters (using the DB_CONNECTION_MUTATOR configuration) to avoid resource contention; optimized dashboard queries are sent to one cluster while ad hoc SQL Lab and explore queries are sent to another.
- Limit concurrency for each user: We limit each user to running only three queries simultaneously on our database engine. While this may seem like a small number, it’s actually more than sufficient given the cache warmup job previously mentioned. In the ideal case, very few queries actually make their way to our database engine, and the cache efficiently returns the results instead of rerunning the query.
- Restrict large queries: We limit the size of queries that can be run to a certain memory size or partition count. This encourages users to create efficient queries of reasonable complexity.
Where Superset Really Shines
As a BI solution, Superset is capable of satisfying most of our needs, though there are also many similar products on the market. We continue to use and invest in Superset for many reasons — familiarity, content, migration costs, etc. — but where Superset really shines for us is the fact that it is open source. This has allowed Airbnb to implement a number of advanced customizations that would likely have been difficult with commercial products.
As an active contributor to the Apache Superset project, we were able to adapt Superset for our business needs via:
- Helping to define the open-source roadmap
- Proposing and implementing open-source features
- Creating custom, in-house overrides or mutations. The Superset backend is written in Python which supports easy augmentation and customization via monkey patching.
When evaluating whether to build an internal solution or buy something off the shelf, one is faced with the 80/20 conundrum. An off-the-shelf solution will likely get you 80% of what you need, but the final 20% may be fraught with insurmountable challenges. Although Superset currently lacks some of the features and polish that other SaaS solutions offer, it makes up for these deficits in spades through the level of potential customization it provides.
Below are a few projects where Airbnb has leveraged or augmented Superset to enhance — either by streamlining or enriching — the user experience.
Metric Explorer, a component of the Dataportal (Airbnb’s search and discovery tool), enables out-of-the-box data exploration for teams across Airbnb. The goal was to make it easy and safe for anyone to explore curated business metrics, courtesy of the Minerva framework, for a typical reporting period — last 7 days, prior week, etc.
When designing Metric Explorer, we were faced with a dilemma. We wanted to provide a highly curated and vetted experience for slicing and dicing metrics that leveraged rich metadata and surfaced business context, while providing sufficient guardrails. However, we did not want to build another dashboarding tool and reimplement large swaths of Superset’s features.
We decided to solve this dilemma by factoring out the frontend foundation of Superset visualizations into the @superset-ui NPM packages. Not only did this solve the Metric Explorer use case, it also enabled any Superset installation to build other custom data applications that leverage the Superset backend. Figures 1 and 2 are Metric Explorer screenshots illustrating the Superset integrations.
Security Manager and Data Access Policy Integration
Though Superset ships with a default Security Manager, the scale of the data at Airbnb and the complexity of our data access policy required a custom implementation. Restrictions are defined at the underlying table or metric level rather than at the level of Superset entities — charts, dashboards, etc.
We leveraged Superset’s custom security manager functionality, via the CUSTOM_SECURITY_MANAGER configuration option, and some RESTful API and Flask-AppBuilder overrides. Using this approach, we were able to seamlessly integrate Superset to adhere to the data access policy enforced by Airbnb’s internal security controller.
We wanted to further enrich the user experience by integrating the access request flow directly within Superset. This was achieved by adding frontend customizations, in conjunction with the custom security manager, which prompted users through a flow whenever we detected a priori (i.e., before the actual query was run) that they did not have the relevant permissions to access the underlying data. Figures 3–5 illustrate the data access policy integration within Superset.
By surfacing access requests in-place, instead of having users decipher cryptic database errors or directing them elsewhere, we were able to preserve the user flow and simultaneously provide the approvers with the necessary context about the request. This deeply integrated experience would most likely be very difficult to provide with other, less customizable tools.
Metrics for the Masses
As mentioned previously, Airbnb developed the in-house Minerva metric framework. Data can be queried in Superset via the Minerva API, a metric-centric, pseudo-datasource-agnostic SQL database backed by an Apache Druid cluster. To aid with discovery and enhance the user experience, all the metrics and dimensions are defined in a single non-mutable virtual Superset datasource with pre-defined metric expressions. Since our internal security controller supports permissioning at the metric level, the datasource remains functional from an access control perspective.
This single datasource now encompasses thousands of metrics and dimensions. Since most metrics and dimensions are generally scoped to products or projects by construction, the vast majority of metric-dimension combinations are not viable.
To avoid user frustration on accidentally selecting an infeasible combination of metrics and dimensions, we made an open source contribution to Superset’s chart controls, introducing a hook to asynchronously update control props based on user inputs. This allows us to filter out invalid metrics and dimensions according to what users have selected. Figures 6 and 7 illustrate the behavior.
A traditional BI tool would likely not be able to handle data at this scale or would result in severe usability issues given it would not be apparent to the user which dimensions are applicable to filter or group-by for a specific set of metrics.
As evidenced by the above examples, we at Airbnb have invested heavily in Superset over the past half decade. The time and effort we have put in allowed us to create a BI ecosystem that enables any employee to self-serve the analytics they need to perform their job in a data-informed manner. Superset’s configuration and customization capabilities, along with the ability to build the product roadmap through open source, have provided a stable foundation to keep Superset relevant for years to come.
In this post, we focused on how we evolved Superset for Airbnb’s large-scale needs, but other companies have leveraged Superset in different ways as well. You can learn about some of these here:
- Dropbox: Why we chose Apache Superset as our data exploration platform
- Nielsen: How Nielsen Scaled Access To Data Analytics Using Apache Superset
- Preset: Apache Superset 1.0 is out!
Thanks to everyone who contributed to the work represented in this blog post, especially Chris Williams, Gustavo Torres, Jinyang Li, Krist Wongsuphasawat, Michelle Thomas, Serena Jiang, and Sylvia Tomiyama.
Apache Superset, Apache Druid, Apache Airflow, Superset, Druid, Airflow, Apache, and the Apache Superset logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
* Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis Labs Ltd. Any use by Airbnb is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Airbnb.
All trademarks, service marks, company names and product names are the property of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship and endorsement.