Scaling Knowledge at Airbnb

Feb 25, 2016 · 7 min read

by Chetan Sharma and Jan Overgoor


A responsibility for the Data team at Airbnb is to scale the ability to make decisions using data. We democratize data access to empower all employees to make data-informed decisions, give everybody the ability to use experiments to correctly measure the impact of their decisions, and turn insights on user preferences into data products that improve the experience of using Airbnb. Recently, we’ve started to tackle a different type of problem. As an organization grows, how do we make sure that an insight uncovered by one person effectively transfers beyond the targeted recipient? Internally, we call this scaling knowledge.

When our team consisted of just a handful of people, it was easy to share and discover research findings and techniques. But as our team has grown, issues that were once minor have become more significant. Take the case of Jennifer, a new data scientist looking to expand on work produced by a colleague on the topic of host rejections. Here’s what we’d see happening:

  1. Jennifer asks other people on the team for any previous work, and is sent a mixed bag of presentations, emails, and Google Docs.

Based on conversations with other companies, this experience is all too common. As an organization grows, the cost of transmitting knowledge across teams and across time increases. An inefficient and anarchic research environment raises this cost, slowing down analysis and the speed of decision making. Thus, a more streamlined solution can expedite the rate at which decisions are made and keep the company nimble atop a growing base of knowledge.


As we saw this problematic workflow play out over and over, we realized that we could do better. As a team, we got together and decided on five key tenets for what we wanted in our DS research:

  • Reproducibility — There should be no opportunity for code forks. The entire set of queries, transforms, visualizations, and write-up should be contained in each contribution and be up to date with the results.

With these tenets in mind, we surveyed the existing set of tools that had solved these problems in isolation. We noticed that R Markdowns and iPython notebooks solved the issue of reproducibility by marrying code and results. Github provided a framework for a review process, but wasn’t well adapted to content outside of code and writing, such as images. Discoverability was usually based on folder organization, but other sites such as Quora were structuring many-to-one topic inheritance with tags. Learning was based on whatever code had been committed online, or via personal relationships.

Together, we combined these ideas into one system. Our solution combines a process around contributing and reviewing work, with a tool to present and distribute it. Internally, we call it the Knowledge Repo.

At the core there is a Git repository, to which we commit our work. Posts are written in Jupyter notebooks, Rmarkdown files, or in plain Markdown, but all files (including query files and other scripts) are committed. Every file starts with a small amount of structured meta-data, including author(s), tags, and a TLDR. A Python script validates the content and transforms the post into plain text with Markdown syntax. We use GitHub’s pull request system for the review process. Finally, there is a Flask web-app that renders the Repo’s contents as an internal blog, organized by time, topic, or contents.

On top of these tools, we have a process focused on making sure all research is high quality and consumable. Unlike engineering code, low quality research doesn’t create metric drops or crash reports. Instead, low quality research manifests as an environment of knowledge cacophony, where teams only read and trust research that they themselves created.

To prevent this from happening, our process combines the code review of engineering with the peer review of academia, wrapped in tools to make it all go at startup speed. As in code reviews, we check for code correctness and best practices and tools. As in peer reviews, we check for methodological improvements, connections with preexisting work, and precision in expository claims. We typically don’t aim for a research post to cover every corner of investigation, but instead prefer quick iterations that are correct and transparent about their limitations. Our tooling includes internal R and Python libraries to maintain on-brand, aesthetic consistency, functions to integrate with our data warehouse, and file processing to fit R and Python notebook files to GitHub pull requests.

Figure 1 — A screenshot of the knowledge feed showing the summary cards for two posts.
Figure 2 — An example of a post examining gap days in host acceptance decision making.

Together, this provides great functionality around our knowledge tenets:

  • Reproducibility — The entirety of the work, from the query of our core ETL tables, to the transforms, visualizations, and write-up, is contained in one file, the Jupyter notebook, RMarkdown, or markdown file.

The Knowledge Repo houses a large variety of content. The bulk of the work consists of deep-dives aimed at answering non-trivial questions, but examinations of experiment results not covered by our experiment reporter are also common. Other posts are written purely to broaden the base of people who do data analysis, including writeups of a new methodology, example of a tool or package, and tutorials on SQL and Spark. Our public data blog posts also now live in the Knowledge Repo, including this one. In general, the heuristic is: if it could potentially be useful to someone in the future, post it!


The Knowledge Repo is still a work in progress. The small team that is working on it continuously addresses feature requests. We are working towards adoption by all teams in the company that do research, such as Qualitative Researchers who don’t use GitHub. To that end, we are testing an internally built review process on an in-browser Markdown editing app. Another possible feature is a moderator for proposing and upvoting research topics. We are also looking into making it easier to fork existing posts and continue working on them as new posts.

One issue we are working on is the synthesis of the work that is done by individuals. Now that we have an archive for our knowledge threads, we have an opportunity to weave them into a shared understanding of Airbnb’s ecosystem. We put a high premium on employees owning their own strategic impact, and want to enable everyone to develop a broad and deep understanding of any side of the business. We are currently iterating on the best way to keep content current and structured while the company is rapidly evolving and team structures changing.

Especially in the nascent field of Data Science, different data teams are likely reinventing the wheel regularly. In sharing our process we hope to inspire other organizations to work towards addressing the kinds of problems we are trying to solve, and share their learnings so we can collaboratively produce best practices.

Check out all of our open source projects over at and follow us on Twitter: @AirbnbEng + @AirbnbData

The Airbnb Tech Blog

Creative engineers and data scientists building a world…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store