Discover San Diego! Constructing an Airbnb Recommendation System With Machine Learning

10 min readNov 3, 2022

Airbnb is an online marketplace that is focussed on connecting people who rent out their homes with people who are looking for accommodations around the world. If you’re like me and love to travel, Airbnb provides a cheaper alternative to hotels while also offering an eccentric experience that adds value to vacations.

I personally love traveling to the west coast and finding unique Airbnb listings. Sometimes I’d find myself wanting to go back but have trouble looking for a similar experience. There is currently no system in place where Airbnb will provide (or recommend) similar homes I have previously stayed in.

The Goal: Develop a machine learning recommendation system that can provide recommendations for similar listings that I have stayed in. Additionally, what can we learn from text descriptions for Airbnb listings?

End Result: I developed and deployed a web application that provides recommendations for San Diego Airbnb listings and similar listings based on user inputs.

Discover San Diego! Developed by Eric Au via Streamlit.

A preview of my web application that gives recommendations based on previous Airbnb stays in San Diego.

For more on the technical design and construction of this application, here is a link to my Github repository. Now, let’s take a deeper dive into the machine learning and analytical concepts.

Machine Learning Concepts

For this analysis, the following machine learning concepts were practiced:

Recommendation System:

A recommendation system helps users find compelling content in a large mass of data. For example, a machine learning recommendation model determines how similar listings are with other listings. Then it provides recommendations based on the calculated similarities.

A recommendation engine can display items that users might not have been able to search for on their own. This is the basis of most recommendation systems and the inspiration for this project.

Cluster Analysis:

Clustering is a type of unsupervised learning method, whereby information among groups are gained from unlabeled data. Clustering allows machines to pick up on similar or dissimilar patterns that would otherwise be difficult to discern.

Clustering allows us to find patterns we normally couldn’t discern ourselves. Photo by Amit Lahav on Unsplash

We can use clustering as a means to group Airbnb listings when it comes to finding recommendations of similar listings.

Natural Language Processing (NLP):

As a side exploration, I also wanted to practice some natural language processing analysis when it came down to assessing text in Airbnb listings.

Natural language processing, or NLP, makes it possible for computers to understand the human language. NLP analyzes the grammatical structure of sentences and uses algorithms to find and extract the individual meaning of words.

Photo by Find Experts at Kilta.com on Unsplash

We are most accustomed to NLP schemes with virtual assistants like Alexa or Siri, that translate spoken text into meaning that the machines can understand. I performed NLP analysis on Airbnb listing descriptions to gain some insight into how Airbnb listings are typically described in the market.

Data Understanding

The dataset for this project consists of over 13,000 rows of data for San Diego Airbnb Listings as of August 2019 and publicly sourced from data.world via Inside Airbnb.

There are 75 features, and in general, consist of the following:

unique listing ids & urls,
text descriptions of the listing (name, summary, space, neighborhood overview, amenities, house rules, city, neighborhood, property type, room type, bed type, etc),
text descriptions of the host (host name, about, response time, location, etc)
numerical descriptions (host response rate/time, number of bathrooms, bedrooms, accommodation, price, number of stays, number of reviews, review scores, etc)
binary values (instant bookable, license requirements, host identity verification status, etc)

Data Cleaning & Preprocessing

In general, the following steps below describe the major data cleaning and preprocessing performed before conducting analysis.

Handling Missing Values: There were many missing values discovered in dataset. For example, host_response_time contained over 2,100 rows of missing entries. Since this was a categorical ordinal column, these missing values were imputed (or filled in) with an 'N/A' value to represent Airbnb hosts who have not responded back to hostees. Other numerical missing values such as security_deposits were imputed with the value 0 (assuming that a security deposit was not needed for the listings).
Encoding Categorical Features and Values: Categorical features were split into ordinal and nominal features. Ordinal features (columns where the values have a structured order) consisted of host_response_time and cancellation_policy and were encoded using an OrdinalEncoder. Nominal features (columns where values have no order of precedence) consisted of all other categorical features (ie. property_type) and were one hot encoded.
Standard Scaling: All other numerical columns consisting of integer and float values were subsequently scaled using a StandardScaler.

The above preprocessing steps were then incorporated in a column transformer consisting of numerical, ordinal, and nominal features which takes in the original dataset and produces a preprocessed dataset ready for analysis.

In total, the final preprocessed dataset consisted of 13,039 listings and 240 features.

Clustering Analysis

A Uniform Manifold Approximation and Project (UMAP) dimensionality reduction technique was leveraged to create a clustering visualization of all data points following preprocessing.

With more complex datasets, it becomes increasingly more difficult to visualize data in a multi-dimensional space. UMAP allows for a low dimensional projection of the data that has the closest possible equivalent to the complex topographical structure.

Clustering labels were constructed using a MiniBatch KMeans iterating through the preprocessed dataset to determine optimum cluster size. A total of 5 unique cluster groups were generated with labels assigned to each individual listing.

Below is a snippet image of San Diego Airbnb Listings Embedding via UMAP whereby users can hover over each individual data point to obtain a better understanding of the features associated within each cluster and the cluster label assigned to the data point.

Clustering labels were assigned to each listing. A UMAP embedding visualization was then generated using the Bokeh visualization package.

Based on some additional EDA, the following observations about each cluster group can be generally summarized as follows:

Cluster Label 0 (Red) — Favorable high end listings

Favorable and wide range of review rating. Most expensive listings and mostly consist of entire home room types.

Cluster Label 1 (Orange) — Favorable highly rated & moderately priced listings

Popular group, generally > 90 review ratings, relatively inexpensive. Mostly houses or private rooms, wide range of property types.

Cluster Label 2 (Yellow) — Favorable moderately priced diverse listings

Most popular group, mostly favorable ratings. Relatively low priced. Wide range of property types.

Cluster Label 3 (Green) — Favorable and least expensive listings

Popular group and wide range of review rating. Least expensive group. Wide range of property types.

Cluster Label 4 (Purple) — Unfavorable listings

Least popular group and lowest rated listings.

Building the Recommendation Engine With Cosine Similarity

Now let’s get to the fun part! Recall that the goal is to provide recommended Airbnb listings that are most similar to a previous listing a user has stayed in. In order to construct a recommendation, the mathematical concept of cosine similarity was leveraged.

In short, cosine similarity measures the similarity between two vector points in a defined space using the cosine angle between these two vectors. Therefore, the smaller the cosine distance, the more similar two items are with one another.

For two items, cosine similarity measures how far apart (or similar) each item are away from each other. Cosine similarity values closer to 1 suggest closer resemblance.

In order to calculate cosine similarity, the preprocessed dataset and user selected listing need to be converted to 2-dimensional arrays. In this context, these are individual arrays of Number of Rows x Number of Columns in each respective dataset.

Once converted to 2-D arrays, we can easily calculate cosine similarity using the following equation, where A is the user selected listing and B is the preprocessed dataset containing all listings:

Cosine similarity values for all listings are sorted by top 5 most similar listings to generate recommendations.

First, a selected listing is chosen and then passed into the recommendation pipeline. Then, we have an output of recommendations which sorts the top 5 most similar listings by highest cosine similarity as shown below:

Recommendations for Airbnb listings are ordered by most to least similar. Note that the top listing has a similarity of 1 because it is the original listing we are attempting to find recommendations for.

A Simplified Recommendation System

Focussing on a user friendly application, a simplified recommendation system was also constructed with the same concept of utilizing cosine similarity. However, instead of incorporating all 240 features in the preprocessed dataset, the simplified preprocessed dataset was reduced to having 147 features instead.

This meant the following features were retained for the simplified dataset: neighborhood, property type, room type, accommodation, bathrooms, beds, nightly price, and review score rating.

Features selected for the simplified recommendation engine were chosen for user interpretability and convenience.

With the remaining features in the simplified dataset, user inputs are fed into the recommendation engine to produce similar recommendations.

Check out the recommendation engine in action for yourself with this link!

Natural Language Processing Analysis

In addition to the numerical datatypes present in Airbnb listings, text data can also be found in the descriptive columns and provide extra insight into a particular listing.

In general, these descriptive text features were isolated from the overall dataset in order to perform an NLP analysis. Text data, in general, is messy data. Like any other dataset, there are several steps required to clean the messy data into a manageable and workable format.

A snippet of the text features present for each individual listing. Note the missing values and overall inconsistencies with the format of the text values.

Normalization & Tokenization

We want to first normalize the text data so that there is consistency throughout the data and that randomness has been eliminated.

Stop words: One technique to normalize data is to remove stop words, or common words in the common lexicon of language. For example, these words include “the”, “an”, “is”, “in”, etc. Because these words do not provide any valuable insight into the analysis, we effectively reduce computing power and create a more manageable dataset.

Lemmatization: The remaining words are then lemmatized, or are reduced to their root word. For instance, lemmatizing the word Caring would return Care. Thankfully, the WordNetLemmatizer library as part of the NLTK package helps us quickly implement lemmatization to the dataset.

Tokenization: As a final step, remaining text are tokenized, or are segmented into a list of individual words. Effectively, we will have created a list of remaining significant words that is ready for analysis.

As a final data cleaning measure, missing values found for text features have been imputed with a blank space.

Compared to the pre-cleaned dataset, we now have a more manageable text dataset to work with!

Once the text data has been cleaned, we can finally make some observations regarding each individual text feature. By creating word text clouds, we can easily visualize patterns in text that would otherwise be difficult to discern.

A word-cloud generated for ‘Summaries’ for Airbnb listings

A word-cloud generated for ‘Descriptions’ for Airbnb listings

Findings:

Most listings tend to be described as scenic and picturesque by the beach (or some variation of paradise).
A lot of listings are ironically “hidden”.
Must have wifi, tv, parking, and large beds!
The hosts must have a lot of spare time to rent out Airbnbs as their side jobs are also involved in entertainment.

Sentiment Analysis

A sentiment analysis was also performed and to gather further insights about how Airbnb listings are generally described. The process of understanding sentiment scores is described as follows:

TextBlob Module: Allows for the ability to place a score on sentiment of words based on where it is in a sentence.
Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity.
Polarity: How positive or negative a word is; -1 is most negative, +1 is most positive.
Subjectivity: How subjective, or opinionated a word is; 0 is fact, +1 is an opinion.

Takeaways

Airbnb listings tend to be positive when it comes to descriptions and summaries. This makes sense, hosts want to encourage people to stay at their Airbnb and having a positive description is beneficial. However, these descriptions tend to be grounded in opinion.
Factually based columns such as access, notes, and transit are unsurprisingly factual.
Interesting to note that amenities are considered very opinionated. One would expect that amenities would be more grounded in facts.

Next Steps & Recommendation System Limitations

Having constructed a recommendation system based on user based content filtering, there are a few nuances as part of this project:

URL links are not entirely up to date. Since the data for this recommendation system consists of 2019 data, there are instances where the url links to the actual Airbnb listing do not exist anymore. However, conceptually, the recommendation system is still effective when it comes to analysis of other content based features.
While the dataset contained textual data, the sentiment analysis was limited to description of listings by the hosts. Moving forward, I’d like to incorporate a sentiment analysis of users who have previously stayed at a listing (ie. user reviews). With this analysis, there is potential to create a recommendation system of ideal Airbnb listings based on the sentiment analysis of user reviews.

For more information or any questions, please feel free to reach out through the links below:

Github Repository: https://github.com/eric8395/airbnb_recommendations

LinkedIn: Eric Au