How Airbnb manages to monitor customer issues at scale

By Alvin Sng & Simon Hachey


Our story begins with a question:

How do you discover new, growing problems from your customers right now?

For a typical person, this seems like a pretty straightforward question that every company should be able to answer. However, we found that it wasn’t so simple.

Of course, we had the data of customer issues stored in our database but we didn’t have a great way to answer the question. We didn’t have a way to bubble up trending problems that were starting and we lacked the ability to see this in real time. Most of the trends we discovered were found on an ad-hoc basis which meant that by the time we discovered a trend, it was too late to act on fixing it. So with that in mind, we began to devise a solution.

To provide some background, since Airbnb began, we’ve handled 80 million guest arrivals and we’re quickly growing. With our rapid growth, our engineering team is finding ways to tackle the new and challenging problems that arise. A large part of Airbnb’s operation relies on having customer service agents to handle the high volume of incoming questions from our hosts and guests — check out this previous post by Emre Ozdemir. One of our challenges is to understand this large volume of tickets and detect trends or unexpected problems as they occur in real time. We need a way to monitor and alert whenever we see an increase in folks calling us about a certain issue.

What we built

We built a web-based service that computes trends in all our tickets and visually displays the top trends at any given time in history. It considers several different attributes about a ticket such as issue type, browser version, user country, subject line, source, and more. With this data, it analyzes a time series of each attribute across all tickets and ranks them using an algorithm to detect spikes or trends. We actually run two different algorithms side-by-side to make improvements while still having a previous baseline to compare against.

How we made it work

The Infrastructure

The infrastructure required to make this scale involved a few different pieces. First we had the data store where the tickets were going to be stored and queried quickly for our data crunching. We ran Elasticsearch to handle this. Next we needed to run a web service that could take in requests of new ticket data. This become our Node.js app that served our web app and processing of incoming tickets.

We also ran a separate job instance that was constantly computing ticket trends. It would query the Elasticsearch instance for ticket data and store the results into our Redis instance. Redis maintained a cache of ticket trend results for our Node.js app to render on the web. The entire front-end was built in React, making it easy to develop a rich UI to display the data.

The data store

We decided to stream all tickets into an Elasticsearch cluster in real-time, as they are created. This data is then consumed by our batch jobs that run on regular intervals to compute ticket trends. Having the tickets stored in Elasticsearch makes it easy to scale and perform aggregate queries on our data set. The tickets’ fields were indexed depending on the type of data stored.

The job worker

The role of our job worker is to query our data store and compute a trend score for each set of tickets. We did this on a separate instance from our web workers because we didn’t want long processing to delay the latency of serving incoming API requests of new ticket data. The resulting trend data is then sent to Redis, which allows for our other web instances to fetch that data.

How we detected trends

Trend detection was core to making this all work. We start by running a multi-search query into Elasticsearch to get a time-series with ticket count for every ticket attribute that we wish to consider for trends, ending at the time we wished to consider. We then apply our scoring model to each time-series, sort the results, and return all trending attributes above a minimum threshold. These results become the trends for that time period.

The scoring model’s job is to adjust for periodicity (e.g. daily fluctuations in counts), remove noise and smooth the graph, and then decide if there is a spike. In order to smooth and adjust for periodic trends, we transform the graph into the frequency domain using a Fourier transform, find the peak frequency, and discard all other frequencies except for those in a close band to the peak. This produces a smoothed graph (shown in input-ifft) of the periodic component of our ticket counts. By subtracting this from our original graph, we can get an estimate of how much our graph deviates from its expected value. The final score is determined by looking at a few factors in the final adjusted graph, like a change in max and total ticket volume over time.

After computing all the scores for each ticket field for each hour, we then sorted the stored top scores into Redis for consumption by the frontend.

Web UI

Once our computed data is in Redis, it became very fast for us to display it. We added the ability to look at past trends for any point of time and also to look at the most trending items over a longer range of time.

Stories

Our dashboard has already had an impact in detecting issues quickly since it has been released. As an example, we noticed a spike in users reporting they could not see their listing in search. New users often have these sorts of issues when first starting on the platform, so individually our customer service agents didn’t think much of it. Additionally, since it wasn’t a full failure where no listings were being returned (but rather a subtle edge case), it wasn’t immediate to the engineering team that there was an issue. However, because we saw this was a ticket spike, we were confident that an issue existed and needed fixing. Our engineers were quick to fix the problem, and after a while we observed the corresponding drop in tickets from our tool, assuring us that the issue was actually fixed. Without this dashboard, this incident could have lasted days or even weeks, causing higher strain for our agents and user frustration.

What we learned

We’ve been using this dashboard at Airbnb for over six months and have uncovered things that would have been much harder to discover otherwise. We’ve been able to catch many spikes, ranging from subtle bugs to small problems with the potential to get big. Our ticket dashboard doesn’t replace our existing monitoring systems for outages and system errors, but is used as the catch-all of issues.

This new ticket monitoring system has proven to be an invaluable tool that we believe every large company should have. It has reduced our users’ frustration by prioritizing the most pressing issues to fix immediately. We estimate that this ticket dashboard has been able to reduce our overall ticket volume by 3%. Sometimes it is crazy to see how a two-person hackathon project ends up saving Airbnb customers a ton of time and get things back on track quickly.

If you have any questions or comments, leave them in the space below.


Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData