Contextualizing Airbnb by Building Knowledge Graph
An introduction to Airbnb’s knowledge graph, which helps us categorize our inventory and deliver useful travel context to our users.
By Xiaoya Wei and Yizheng Liao
Imagine you are planning a trip to Los Angeles. The first step is to visit Airbnb.com and search for “Los Angeles.” On the backend, the query “Los Angeles” is translated into a block on the map; available Homes within this block are returned in many pages of search results. Is that enough for you to make your trip plan?
As Airbnb moves towards becoming an end-to-end travel platform, it is increasingly important for us to deliver travel insights that help people decide when to travel, where to go, and what to do on their trips. For example, what are the most popular landmarks and neighborhoods in LA? Are there any upcoming concerts or sport events that might coincide with my trip? Which Airbnb listings are best for families? What is the most affordable time of year to visit L.A.? All of this content and context is helpful for vacation planners, and the more accurate, useful travel information we can provide, the more our users will trust us.
To scale our ability to answer these travel queries, we needed a systematic approach to storing and serving high-quality information about entities (e.g. cities, landmarks, events, etc.) and the relationships between them (e.g. the most popular landmark in a city, the best neighborhood for tacos, etc.). To tackle this problem, we have spent the past year building and applying a knowledge graph that stores and serves structured data that connects what makes our inventory unique, what our users are looking for, and what the world of travel has to offer.
In our previous post, we shared a high-level overview of what knowledge graph is and how it works at Airbnb. In this post, we will dive into how we built the knowledge graph infrastructure and share our learnings in this process. We will also introduce how we use the knowledge graph to categorize our inventory and contextualize the entire platform.
The diagram below illustrates the architecture of the knowledge graph service as of today at Airbnb. It can be divided into 3 components: graph storage, graph query API, and storage mutator. In this section, we will get into the details for each of them.
The first thing we built for the knowledge graph infrastructure is a graph storage module. We adopted an in-house relational data store as the underlying database, on top of which we implemented a node store and edge store such that one can directly perform CRUD (create, read, update, and delete) operations on nodes (entities) and edges (relationships), instead of dealing with rows in database tables. Each node or edge is assigned with a global unique identifier (GUID). We can fetch nodes and edges with GUIDs; in addition, we can also fetch specific types of edges that connect certain nodes.
To build ontology and relationships into the knowledge graph, the nodes in the graph storage are divided into different node types. In addition, each node type is defined by a unique schema. For example, a place node is defined by the name and GPS coordinate while the event node type is defined by the name, date, and venue. These different node types are stored in separate tables in the underlying database.
Similarly, edges can be of different edge types to reflect different types of relationships among entities (such as landmark-in-city and language-spoken-in-country). In correspondence to domain and range in RDFS, each edge type has a configurable constraint for the type of nodes that it starts from and connects to. For example, a landmark-in-city has to connect from a landmark node to a city node.
Moreover, the graph storage is designed to store edges from different data sources, so that multiple teams (as data owners) can contribute data to the knowledge graph. Thus, each edge also stores the source and confidence score for each edge. To guarantee that a data owner’s operation is unlikely to affect data from other teams, we store edges from each data source in a separate table in the underlying database. The storage can also store additional payload for edges; an example is the distance between the Home listing and the landmark for a home-near-landmark edge.
At Airbnb, the data in the knowledge graph is not always consumed through online queries, so we also dump a daily snapshot of the nodes and edges into a data warehouse for offline usages. Applications, such as our auto-complete service, depend on the knowledge graph’s data dump for their product needs. In addition, we also apply machine learning technologies on the data dump for purposes including graph embedding, knowledge inference, etc.
Lastly, we’d love to reflect on our choice on the underlying database for graph storage. Why did we adopt a relational database instead of a graph database? The short answer is operation overhead. At the time, we didn’t have a production-ready graph database at Airbnb, and using the existing relation database has the following advantages:
- Our in-house relational database proved reliable as it had been widely used. It also came with a lot of useful features, such as an easy-to-use client, schema migration tools, monitoring and alerting as well as daily data export.
- Using a graph database meant we would have to set it up within Airbnb’s foundation, debug any reliability / performance issue, and develop additional features that we would need. It would slow down our progress and distract our focus on the knowledge graph itself.
So far, the graph storage has satisfying performance with the relational database. We also carefully encapsulated and consolidated the logic to deal with the database together and hide them from the rest of the knowledge graph codebase. By doing that, we have the flexibility to replace the underlying database whenever it is necessary in the future.
Graph Query API
As we started using the knowledge graph in production, we noticed that most of the product use cases needed to traverse a subgraph and retrieve nodes and edges from that traversal. For example, in Airbnb’s product detail page (PDP, or a listing page), the knowledge graph is queried to display points of interest near the Home listing, and photos for each of the restaurants, museums, or landmarks mentioned. With terminologies in graph theory, this query needs to traverse (1) all place nodes that are connected to a specific Home listing node, and (2) photo nodes connected with the place nodes fetched in the previous step.
To support these product needs, we implemented a graph query endpoint in addition to CRUD endpoints for nodes and edges in the knowledge graph API module. With a graph query, one can traverse the graph by specifying a path, which is a sequence of edge types and data filters, starting from certain nodes, and receive the traversed subgraph in a structured format. The graph query API has a recursive interface such that one can traverse the knowledge graph with multiple steps.
To give you a taste, let’s look at an example: If one wants to find all place nodes connected with the city node “Beijing” with edges of type “contains_location” such that they (1) have more than 5,000 listings around and (2) belong to the “scenic” category. This query can be written as follows.
As mentioned above, the knowledge graph is designed to store data from multiple data sources. Through our knowledge graph API, data from all sources are available to query. In a graph query, one can specify the data sources which to query data from. Meanwhile, we are also working on a data reconciliation layer, which aims to aggregate data from different sources, to reconcile conflicts and provide a consistent view of data when users don’t know which data sources to trust.
By now, the knowledge graph can perfectly support use cases such as fetching all landmarks close to a Home at Airbnb, since it can be converted to a graph query. However, there are use cases that cannot be directly supported with a graph query — for example, to fetch the most popular landmark around a Home. We are now actively investing efforts to deal with such fuzzy queries by incorporating the landmark’s metadata and the user’s personalization signals via ML.
For many of our product use cases, we need to constantly import data to the graph storage and propagate these mutations downstream. There are cases when it is suboptimal to synchronously write data through the knowledge graph API, for the following reasons:
- It is an operational burden to synchronously call the knowledge graph API in every pipeline that writes data to the knowledge graph, since the pipelines are implemented within a different tech stack (e.g. Airflow, IDL service, etc.) and each pipeline needs to deal with issues like rate limit, retrying on exception, etc.
- Writing data through the API will potentially interact with other crucial online usages (e.g., search, PDP, etc.) of the knowledge graph, especially when there is a spike in the writing traffic or when the writing path on the graph storage is faulty.
On top of the graph storage, we built a storage mutator to resolve this issue. In addition to calling the API, a data pipeline can also send a mutation request to the knowledge graph via emitting a message with a specific Kafka topic to our Kafka message bus; the mutation consumer subscribes to this topic and writes data into the knowledge graph correspondingly upon receiving the messages. This pattern facilitates the process of writing data into the knowledge graph from various pipelines and is now the primary way for us to import data. We are also planning to use it for functionalities such as storage rollback and 3rd-party data ingestion.
In the storage mutator, we also built a mutation publisher to propagate data mutations to the Kafka message bus. Downstream pipelines can consume these messages for their product use cases. An example is the search index pipeline, in which the knowledge graph populates categorization data into the search index via this pattern. We will dive into this use case in the next section.
Using Taxonomy to Categorize the World of Travel
Today, there are over five million Homes on Airbnb. In order to help travelers find the single best Home for their trip, we first needed to establish a deep understanding about every single Home on our platform. For example, which Homes are best for families and which allow 24-hour check-in? To support this use case, we built a rich taxonomy in our knowledge graph and applied it to categorize all of our inventories at Airbnb.
To enrich semantics in the knowledge graph, we built a taxonomy as a part of our ontology, which is the vocabulary we use to describe our inventory and the world around us. The taxonomy is in a hierarchical structure which represents concepts in different levels of granularity, such that we can map higher-level concepts to very specific ones. For example, “Beachfront” in the screenshot above is a tag in our taxonomy, while its parent tag is “Nature Venue” and grandparent tag is “Venue.”
Our taxonomy was started even before the knowledge graph, with the purpose to categorize Experiences at Airbnb. Later, the taxonomy was migrated into the knowledge graph as a special type of nodes. Nowadays, the taxonomy is revised to be universal and applies to all verticals of business (Homes, Experiences, Restaurants, etc.) at Airbnb as well as other types of entities (such as place, event, etc.) that are stored inside the knowledge graph.
Given the fundamental role that taxonomy is playing in the knowledge graph, we are treating taxonomy nodes differently from nodes of other types. Any edits to the taxonomy need to be discussed and approved by a cross-functional team consisting of content strategists, product managers, and engineers before being executed.
In order to categorize Airbnb’s inventory, every Experience, Home, or Restaurant needs to be tagged with the relevant nodes in the taxonomy. This process has been challenging. On the one hand, human-powered categorization is expensive and hard to scale; on the other hand, automated categorization efforts require extra work to ensure accuracy. Here are a few different approaches that we have explored.
Airbnb Experiences are tagged manually by a global operations team. To facilitate this process, we built an admin tagging tool with a clean and simple UI. (In the future, our hosts might take on some of this process.)
For automated categorization, we tested several different approaches. First, we made knowledge inferences directly from inventory metadata. To categorize Homes with location amenities (such as the “Beachfront” in the example above), the supply dynamics team is adopting a combination of applying k-d tree to Home locations and text extraction from Home descriptions and guest reviews. They also set up a feedback loop for hosts (see below) to confirm their inference results.
In addition, the AI Labs team applied embedding of Airbnb Homes with the Home’s description, Home neighborhood’s description and host’s profile description. Based on the embedding, they are now actively working on inferring possible missing amenities for Homes at Airbnb. The AI Labs is going to publish a post on the details of their work.
Delivering Travel Context to Users
So far, we’ve explored a few product touchpoints to deliver contextual travel insights to users in the booking flow. The following are a few of the features that are empowered by the knowledge graph at Airbnb.
Inspiring users to select a destination
In order to help users select a destination to begin their search, we launched a series of inspirational destination carousels on our homepage. We stored hundreds of destination photos inside the knowledge graph, and then deliver to users the most relevant travel ideas.
Helping users choose a Home to book
After choosing a destination, users begin the process of deciding on a Home. To help guide users, we used the knowledge graph to surface context and insight about the Homes in a destination. In some cases, we highlight popular amenities (see below), top landmarks, or interesting neighborhoods.
Providing more contexts about a Home
We know from research that users often want to find a Home that’s close to a specific point of interest, whether that’s Disneyland or the Louvre. In order to help users understand how a specific listing relates to key landmarks, we are using the knowledge graph to showcase what’s near a specific Home, and displaying that information on the PDP (see below).
Our work in the knowledge graph has already helped us greatly enhance and personalize searching, supply groupings, and content delivery at Airbnb. There have been challenges as well, especially in data quality and online performance.
In order to tackle these problems, we are leveraging state-of-art machine learning, statistics, and optimization models and algorithms. Specifically, we are building convolutional neural networks (CNN) to vet the quality of tagging. Also, we are deploying contextual multi-armed bandit models to recommend personalized content in an online service. Furthermore, by utilizing product, user, and search query embedding techniques, we hope to generate new categories that are not previously available in are human-defined taxonomy set. Currently, these methods are in the experimental phase, so stay tuned!
We set out to use a knowledge graph in order to provide a consistent interface to clean, current, and complete structured data about our inventory and the world of travel at large. By serving connected and high-quality data, we believe there is a massive opportunity for the knowledge graph to improve the guest and host experiences at Airbnb. In 2019, we will continuously invest to use our knowledge graph to enrich our understanding of the world of travel (categorization) and deliver more travel content (contextualization) to each traveler at every step of their trip planning and decision process.
Thanks to the Knowledge Graph team for their contributions to this project: Lei Shi, Michael Endelman, Bohan Ren, Elizabeth Ford, Pei Xiong and Shukun Yang. Thanks to Xianjun Zhang and Veronica Wharton for their work on automated categorization. We are grateful for Jixin Bao’s support with database and Tong Wei’ generous help for answering general infrastructure questions.
We appreciate Michael Endelman, Joy Zhang and Xiaohan Zeng’s help for proofreading of this post.
Airbnb’s Trip Platform team is constantly looking for talented engineers to join the team! If you enjoy reading this post and feel like working on projects that help travelers feel at home anywhere, please check out our open positions and send your application!