# Digging Into Airbnb Datasets

What’s the point of having quintillion-bytes of data if we’re not going to use it?

You don’t need to be a genius or an expert data scientist to be able to do some cool things with data.

All you need is a little bit of proficiency with a popular language like Python, the almighty Google Search engine, and a few open-source libraries to do some damage.

**That said, I’ll show you some ways we can uncover some interesting things from publicly available data.**

Below, I’ll give you a flavor of the kind of thinking and processes that I put into my data analysis & machine learning projects.

*If you’re interested in seeing all the code that made this happen, check out my Jupyter Notebook on **my Github**.*

The other day, I stumbled across a really neat website that periodically scrapes data from Airbnb from cities around the world. I pulled together listings data from three of the largest metropolitan cities in the US:

- San Francisco, CA
- New York, NY
- Austin, TX.

I looked at the data, then sat for a few minutes and came up with a few questions to get this analysis started…

- What are the strongest predictors for Airbnb listing price?
- Do hosts with many properties give better or worse service than hosts with only one?
- Do reviews matter when considering price?

The first question is going to require some machine learning algorithm work to answer, but the other two questions can easily be solved with a little bit of some good ole’ *Exploratory Data Analysis*. So we’ll start with these two questions first.

Let’s dive in!

## Do hosts with many properties give better or worse service than hosts with only one?

To get me started, I found a feature that lists the number of listings associated with each listing host. After some initial data cleaning and exploring, I plotted this feature’s distribution and noticed some unexpected values.

This didn’t seem right to me, but I have no way of verifying if some of these numbers are incorrect.

In order to compare ratings between different hosts with different numbers of properties, I made a new feature that puts each listing’s host into 5 bins.

Then, I grouped them into a styled DataFrame. Green boxes are maximum row values and red boxes are minimum row values.

So right away it looks like hosts with 1 unit only have higher reviews, on average, than all of the others, but it’s hard to tell if it’s *statistically significant*.

One way we can measure this is by performing an independent t-test comparing average ratings between hosts with only 1 unit and all other hosts.

Scipy’s ttest_ind is useful for this. Below is a plot of the t-distribution and the ttest_ind test result.

**Conclusion: **Using this as evidence, we can conclude that this data provides convincing evidence that the average ratings of hosts with 1 unit are higher than hosts with more than 1 unit.

*Full Disclosure — this test doesn’t hold up because it’s highly unlikely our listings are independent of one another (which is a requirement when performing hypothesis testing), but this is an example of how I would perform hypothesis testing to see if two groups are different*.

## Do reviews matter when considering price?

One way of tackling this is to segment our data into bins with more to least recent reviews.

Using these bins, the “price”, and the “is_superhost” features (Superhosts are experienced and highly rated hosts), we can create a neat set of boxplots after also segmenting on room type.

**Conclusion: **Just on visuals alone, it doesn’t look like price is *dramatically* different between units with more recent to less recent reviews, but we *do* see that units with no recent reviews have slightly higher prices than others (look at the tops/3rd quartiles of the bars), which is a bit peculiar. We also see that Superhosts have higher prices than non-Superhosts for about 9 dodged comparisons (compare the heights of the dark blue boxes with the light blue boxes). These are some interesting finds showing there *might* be a relationship between last review and price and between Superhost status and price, but we can’t derive causation from this.

Remember, correlation is not causation!

## What are the strongest predictors for Airbnb listing price?

We can answer this in a few different ways. Here are two potential approaches…

- Perform Recursive Feature Elimination with Cross-Validation (RFECV) to find out the most important features.
- Train an algorithm like a Random Forest Regressor that calculates a “feature_importances_” attribute and inspect it to see what it comes up with.

Let’s do both!

I’ll spare you the details on how I preprocessed the data. If you’re curious, check out the Jupyter Notebook for the details.

Here’s the TL;DR version of what I did next…

- Cleaned data & removed some missing values
- Imputed remaining missing values
- Encoded categorical variables with OneHotEncode
- Performed feature scaling with StandardScaler
- Performed initial Feature Selection with ANOVA
- Further reduced dimensionality by performing Principal Component Analysis
- Performed RFECV with Linear Regression and trained a Random Forest Regressor (both trained on our principal components to predict price)

So now it’s time to interpret our most important features from the RFECV and Random Forest algorithms.

- RFECV trains a model over and over, eliminating features until the model doesn’t improve. It’s a neat way of boiling down to your most important features. In this case, RFECV left us with 5 principal components, and we can access their coefficients through the “estimator_.coef_” attribute of the trained RFECV object.

2. For our Random Forest Regressor, we can access a useful attribute called “feature_importances_”, which ranks the most useful/important features that the model “learned.”

I decided to pick three components to dig into.

From this analysis, I noticed that both the RFECV algorithm and the Random Forest algorithm “agreed” that principal component 3 was important. In addition, I picked principal component 7 because it had the highest coefficient from RFECV and principal component 2 because it has the highest calculated feature importance from the Random Forest.

In order to interpret what these components mean, we need to inspect their coefficients for each of our original features. But in order to better understand them, we will map them to their original feature names and plot them.

Rather than cluttering you with plots for the all three of these principal components, I’ll show you an example plot that I analyzed, then explain what we can learn from these components with respect to predicting price.

Here’s what the coefficients for principal component 7 look like in my plot.

We have the coefficients’ values on the X-axis and the top 10 and bottom 10 features (sorted by coefficient value) on the Y-axis.

Again, if you want to see all 3 plots, please refer to the Jupyter Notebook. Below, I’ll explain my interpretation of these components to figure out what kinds of latent features they are capturing.

**Principal Component 7:** This component has a very strong positive association with one room type (`Entire home/apt)`

and a very negative association with another (`Private room)`

. It also is positively associated with `accomodates`

(discrete variable for the number of guests), `bedrooms`

, `beds`

, `family/kid friendly`

, and `guests_included`

. Therefore, it seems like this component measures the size of listings/units, pricing *larger* listings *higher* than smaller listings.

**Principal Component 3: **This component has a relatively large association with New York and room type and has negative associations with some amenities, so I interpret this as measuring New York apartment listings as predictors of the unit price. Consequently, this punishes non-NY listings because our feature value for `state_NY`

is *negative* (due to standardization with `StandardScaler`

) for non-NY listings. This doesn't provide very much value since it's focused on one particular state, and this probably stems from the fact that this dataset is heavily imbalanced in favor of New York listings.

**Principal Component 2: **This component has relatively *high* amenities-related coefficients, and given these amenities are associated with *luxury amenities* (`wine cellar`

, `sun deck`

, `steam room`

, `rooftop`

, etc - all things I would expect to find in a very *luxurious *listing), this component measures luxurious amenities, valuing listings higher if they have these.

**Conclusion: **So, to sum up these findings, by performing RFECV with Linear Regression and Random Forest Regression on the dataset’s principal components, we analyzed 3 important components and found that they are tied to size, New York apartments, and luxurious amenities as strong predictors of price. However, it’s important to note that the New York apartments component is probably not very generalizable predictor since it’s so specific to this dataset.

Whew! I know that this escalated quickly. We went from simple exploratory data analysis to dimensionality reduction and two machine learning algorithms!

Let’s sum up our findings from these datasets.

- We found that hosts with only one listing are rated
*slightly*higher than hosts with more than one listing. - We saw that hosts with no recent reviews are priced slightly higher than hosts with recent reviews, and as a bonus, we saw that Superhosts are typically more expensive.
- We found that the size of the property, whether or not the property is an Apartment in NY, and luxurious amenities were all strong predictors of the price.

Publicly available datasets can be a goldmine of information and a great playground to practice your data analysis & machine learning skills!

I hope this has given you some ideas on how you can dig into data and find some cool and interesting things.

**What will you uncover next?**