Aerosolve: Machine learning for humans
Have you ever wondered how Airbnb’s price tips for hosts works?
In this dynamic pricing feature, we show hosts the probability of getting a booking (green for a higher chance, red for a lower chance), or predicted demand, and allow them to easily price their listings dynamically with a click of a button.
Many features go into predicting the demand for a listing among them seasonality, unique features of a listing and price. These features interact in complex ways and can result in machine learning models that are difficult to interpret. So we went about building a package to produce machine learning models that facilitate interpretation and understanding. This is useful for us, developers, and also for our users; the interpretations map to explanations we provide to our hosts on why the demand they face may be higher or lower than they expect.
Introducing Aerosolve: a machine learning package built for humans.
We have been operating on the belief that enabling humans to partner with a machine in a symbiotic way exceeds the capabilities of humans or machines alone.
From the project’s inception we have focused on improving the understanding of data
sets by assisting people in interpreting complex data with easy to understand models. Instead of hiding meaning beneath many layers of model complexity, Aerosolve models expose data to the light of understanding.
For example, we are able to easily determine the negative correlation between the price of a listing in a market and the demand for the listing just by inspecting the image below. Rather than passing features through many deep hidden layers of non-linear transforms we make models very wide, with each variable or combinations of variables modeled explicitly using additive functions. This makes the model easy to interpret while still maintaining a lot of capacity to learn.
The red line encodes the general belief before looking at the data, or the prior. In this case we generally believe that the demand decreases with increasing price. We are able to inform the model of our prior beliefs in Aerosolve by adding them to a simple text configuration file during training. The black curve is the belief of the model after learning from billions of data points. It corrects any assumptions of the person working with the model with actual market data, while allowing human beings to feed back their initial beliefs about a variable.
We also took great care to model unique neighborhoods around the world by creating algorithms to automatically generate local neighborhoods based on where Airbnb listings are located. These differ from the hand made neighborhood polygons in two ways. Firstly, they are automatically generated so we are able to construct these quickly for new markets that just open up. Secondly, they are build in a hierarchical manner, so we are able to quickly accumulate statistics that are point like (e.g. listing views) or polygonal (e.g. search boxes) in a scalable way.
The hierarchy also lets us borrow statistical strength from parent neighborhoods as they fully contain the children neighborhoods. These Kd-tree constructed neighborhoods are not user visible but used to compute local features for the machine learning models. In the figure below, we demonstrate the ability of the Kd-tree structure to automatically create local neighborhoods. Notice the care we have taken in informing the algorithm that it should not cross large bodies of water. Even Treasure Island has a neighborhood of it’s own. In order to not have sudden changes along a neighborhood boundary we take care to smooth the neighborhood information in a multi-scale manner. You can read more, and visually see, this kind of smoothing in the Image Impressionism demo of Aerosolve on Github.
Because every listing is unique in its own special way, we built image analysis algorithms into Aerosolve to account for the detail and loving care the hosts have put into decorating their homes. We trained the Aerosolve models on two kinds of training data. On the left we have trained the model on scores given by professional photographers and on the right the model was trained on organic bookings. The professional photographers tend to prefer pictures of ornate, brightly lit living rooms, while the guests seem to prefer warm colors and cozy bedrooms.
We take into account many other things in computing the demand, some of which include local events. For example in the image below we can detect increased demand for places to stay in Austin during the SXSW festival and could perhaps ask hosts of consider opening their homes during a high demand period.
Some features, such as seasonal demand are naturally spiky. Other features, such as number of reviews, generally should not exhibit the same kind of spikiness. We smooth out these smoother features using cubic polynomial splines while preserving end point spikiness using Dirac delta functions. For example in the relationship between number of reviews and 3 stars (out of five), there is a big discontinuity between no reviews and one review.
Finally, after all the feature transformations and smoothing, all this data is assembled into a pricing model with hundreds of thousands of interacting parameters to provide a dashboard for hosts to inform themselves on the probability of getting a booking at a given price.
Please check out Aerosolve on Github. There are some demos you can find on how to apply Aerosolve for your own modelling such as teaching the algorithm how to paint in the pointillism style of painting. There is also an income prediction demo based on US census data that you can check out as well.
Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData
Originally published at nerds.airbnb.com on June 4, 2015.