Democratizing Data at Airbnb
Like many startups, the number of employees at Airbnb has grown significantly over the past several years. In parallel we have seen explosive growth in both the amount of data and the number of internal data resources: data tables, dashboards, reports, metrics definitions, etc. On one hand, the growth in data resources is healthy and reflects our heavy investment in data tooling to promote data-informed decision making. However it also creates a new challenge: effectively navigating a sea of data resources of varying quality, complexity, relevance, and trustworthiness. In this post we describe our observation of this problem and the Dataportal, a novel data resource search and discovery tool that addresses this issue.
The overarching goal of the Dataportal is to democratize data and empower Airbnb employees to be data informed by aiding with data exploration, discovery, and trust.
Who Are We?
Our team consists of a collection of data misfits: recovering data scientists who understand the numerous pain points associated with data, and visualization engineers who specialize in data communication. We’ve spent time in the trenches, often working in a reactionary, time-critical space. We wanted to design and build proactive solutions that help alleviate common, well-defined data problems.
Designed for All in Mind
Our view of the data landscape is merely one of many. To ensure that we developed a data product that provides universal value we talked to employees across departments, roles, tenure, and data literacy levels, to better understand their pain points and concerns around data.
A constant theme appeared: users often had to ask others where to find the appropriate resource as it was difficult to navigate the data landscape. Additionally, the lack of metadata and context made it hard to trust data. This lack of trust prevented employees from using resources outside of their sphere of knowledge, making them afraid of accidentally using outdated or incorrect information. This resulted in people creating additional resources, further muddying the landscape.
Complexities of a Fragmented Data Landscape
As Airbnb grows so do the challenges around the volume, complexity, and obscurity of data. Information and people become siloed which necessitates navigating an invisible landscape of tribal knowledge. This is an inefficient use of time for people on the journey and for those providing directions.
Scale aside, data is often isolated by tool or team, each providing a myopic localized view of the data-space while lacking global context. For example a dashboard is naive with regards to where the data originated from, and a data table lacks context of its relevance to downstream visualization tools. Furthermore, many data tools have complex permission rules which further fragments sharing and understanding.
The understanding of the entire data ecosystem, from the production of an event log to its consumption in a visualization, provides more value than the sum of its parts.
Defining a Path Forward
It was apparent that we needed to develop a system that enabled a shift in thinking. Relying solely on tribal knowledge stifles data discovery and thus we sought to develop a self-service system which provided transparency to our complex and often-obscure data landscape.
We hope to shift people from thinking of an individual datasource towards the concept of an integrated data-space; the data-space presents a holistic view of the data and thus provides the necessary context for people to be data informed.
The Dataportal provides a framework for best practices with data, providing guard rails where necessary. Our hope is that any employee, regardless of role, can easily find or discover data and feel confident about its trustworthiness and relevance.
From a transparency perspective we set out with the intent to have a single lens into our data-space by providing as much context as possible while observing per-tool access controls to the underlying data.
Modeling the Ecosystem
Our ecosystem is best represented as a graph, which we leverage in the Dataportal as described below. Nodes are the various resources: data tables, dashboards, reports, users, teams, business outcomes, etc. Their connectivity reflects their relationships: consumption, production, association, etc.
In our model the relationships are just as pertinent as the nodes. Knowing who produced or consumed a resource is just as valuable as the resource itself. Relationships provide the necessary linkages between our siloed data components and the ability to understand the entire data-space.
People are data resources too. Finding employees who have used or own a given data resource can increase the efficiency of knowledge sharing.
A graph of the ecosystem has value far beyond tracking lineage and cross-functional information. Data is a proxy for the operations of a company. Analyzing the network helps to surface lines of communication and identify facets or disconnected information.
Enter the Dataportal
To address the pain points above we built a data tool to improve data discoverability and exploration within Airbnb. To achieve this we built out four major product features:
The most important feature of the Dataportal is a unified search across our entire data ecosystem. Employees can search logging schemas, data tables, charts, dashboards, employees, and teams. We surface as much metadata about resources as possible in search cards to build context and trust.
We leverage the topology of the graph to boost search relevance using PageRank for promoting high-quality relevant resources. Well-documented and frequently-consumed resources will result in a higher score which helps to ensure that search drives users to the most desirable entities.
ii) Context and Metadata
From search a user can further explore a resource by visiting its detailed content page.
Data without context is often meaningless and can lead to ill-informed and costly decisions. Therefore content pages surface all of the information we have for a resource across data tools to show how it fits into the entire data ecosystem: who has consumed a resource, who created it, when it was created or updated, which other resources it’s related to, etc.
More metadata translates to being more data informed. This is especially true for data tables, the foundation of any data warehouse. Easily editable metadata promotes the updating of table descriptions and column comments, bypassing complicated and user restricted commands. Surfacing table lineage provides additional context of the data landscape.
We also surface column popularity and the distribution of column values for low-cardinality columns, which helps to bring context to the forefront. Commonly joined tables, related queries, etc. are on our roadmap and will provide even more context.
iii) Employee-centric Data
Employees are the ultimate holders of tribal knowledge so we created a dedicated user page to reflect this. We consolidate all of the data resources an employee has created, consumed, or favorited. Any employee in the company can view the page of any other employee, which promotes transparency from both production and consumption standpoints.
iv) Team-centric Data
Another source of tribal knowledge within Airbnb is teams. Teams have tables they query, dashboards they create and view, team metrics they track, etc. We found that employees spend a lot of time telling people about the same resources so we designed a way for them to link to and curate these popular items.
A Peek at the Tech Stack
Given our ecosystem is represented as a graph, it was both logical and performant to use a graph database to store the data. We choose Neo4j because it integrates well with Elasticsearch (commonplace at Airbnb), and it has Python bindings. We use Flask as a lightweight Python web framework for the API, which is consistent with a number of open source Airbnb data tools like Airflow, The Knowledge Repository, and Superset. The single page web app leverages React and Redux.
One of our project goals is to help build trust in data. We therefore embraced a product mindset from the start to ensure a thoughtful UI and UX, and ultimately build a delightful data product to achieve this goal.
Learnings + Future Work
Our work creating a self-service data culture is not over. It takes time for employees to change habits and incorporate a new tool into their workflows. Although asking a colleague about data is easy, it is highly inefficient at scale. It can be hard to re-train yourself to search for data on your own. Education about data and the Dataportal are ongoing efforts.
Our future work will focus on improving discovery by surfacing insights from the graph in the form of user-specific recommendations and notifications. To further address the problem of trust, we plan on creating a data certification process with a rubric for both certifiers and content. Certified content will be boosted in rankings to help promote its relevance.