Data Storytelling+Methodology Used-Airbnb, NYC Analysis using Python, Tableau, and Excel.

Rohit Batra

17 min readMar 17, 2022

📄 Purpose: #Methodological Document for Data Analysis on Airbnb, NYC

👤 Team: Rohit Batra & Saydain Sheikh

🗓️ Date: 27/02/2022

A. Methodology Approach

1. Research Problem

● For the past few months, Airbnb has seen a major decline in revenue due to the lockdown imposed during the pandemic.

● Now that the restrictions have started lifting and people have started to travel more. Hence, Airbnb wants to make sure that it is fully prepared for this change.

2. Business Understanding

Airbnb is an American company based in San Francisco, California. It operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. The platform is accessible via the website and mobile app.

After all, being an online marketplace for hosting personal homestays and private apartments in the majority, the company had two types of customers. One who hosts their place and the another who books the place for a particular time is the end consumer utilizing the hosted place. Airbnb earns commission from both ends and hence must make sure both of its customers are able to generate value from their business. They also must make the hosted place offered on their platform provide the best services at reasonable prices and lookout for the best technology to ease out the booking process for the end consumer without hassle.

3. Type of Data required to Analyze

A decline in the revenue could be for two major reasons, either the sites hosted on the platform are not able to provide a better user experience or there could be a competitor in the market capturing the market share. Keeping the above in mind, we first try to work on the first reason as that is something internal to the company and can have the data in hand to identify the reasons behind the plummeting of the revenue.

Hence, we use the information of the hosted places on the platform to see where and what can be done to improve the end-consumer experience. The data would majorly include the location and region of the hosted places, in our case we are targeting Borough (New York City) — the Bronx, Brooklyn, Manhattan, Queens, and Staten Island, followed by their host's details, prices of the hosted sites and reviews received by the end consumer.

4. How was the Data was acquired? (Assumption)

● The provided data is captured from the CRM tool used by Airbnb to manage their customers that are hosting sites on their platform.

● The reviews provided in the data frame are assumed to be positive as it is not mentioned whether they are negative or positive reviews.

5. Whom are we presenting?

● Data Analysis Managers: These people manage the data analysts directly for processes and their technical expertise is basic.

● Lead Data Analyst: The lead data analyst looks after the entire team of data and business analysts and is technically sound.

● Head of Acquisitions and Operations, NYC: This head looks after all the property and hosts acquisitions and operations. Acquisition of the best properties, price negotiation, and negotiating the services the properties offer falls under the purview of this role.

● Head of User Experience, NYC: The head of the user experience looks after the customer preferences and handles the properties listed on the website and the Airbnb app. Basically, the head of the user experience tries to optimize the order of property listing in certain neighborhoods and cities in order to get every property the optimal amount of traction.

6. Recommendations

● One-to-one interaction with some property owners in Staten Island, Queens, and the Bronx to identify their challenges for being fully functional for a maximum number of days in a year and allow a booking of more than 10 days of minimum night stay.

● Create some sort of interaction between the Top 5 hosts to share their experience with the rest of the community for better improvement and value-generating ideas.

● Provide discounted commission rates to property owners on keeping the minimum night stay booking window for more than 10 days and property functional for a maximum number of days in a year.

B. Method of Analysis along with code:

1. Data Understanding and Preparation

Before we start the basic understanding of the data in hand, we imported relevant libraries available in Python. Below are the libraries that we imported,

# Importing Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use(‘seaborn-dark-palette’)
from scipy import stats
import datetime as dt
import plotly
import plotly.express as px

# Ignore warnings

import warnings
warnings.filterwarnings(“ignore”)
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.max_rows’, None)

We started with Understanding the Data in hand provided by running basic functions to load and interpret the variables, data types of the variables, dimensions, and size of the data frame. Below is the code used for the same.

# load the dataset

airbnb = pd.read_csv(“AB_NYC_2019.csv”)
airbnb.head()

# Dimensions

airbnb.shape

# Data-Types

airbnb.info

2. Variables in the data frame:

The above understandings lead us to perform basic Numeric and Categorical analysis in depth by using the following function along with some basic wear and tear,

# Numeric Analysis

airbnb.describe()

# Analyzing categorical values

airbnb.select_dtypes(include=[‘object’]).describe()

3. Handling Missing Values and Outliers:

● Then we moved to handle missing values and outliers in the data frame. Starting with the missing values, we identified two columns having an equal percentage of missing values which were last_review and reviews_per_month of around 20.56%. And also, the other two columns had quite minimal missing values which were host_name of 0.4% and name of the place of 0.3%.

● Then we analyzed the values missing in last_review and reviews_per_month carrying NaN values on purpose, meaning they are not missing at random as these hosted sites/places have not received any reviews from the customers. Hence, these places would be least preferred by the future customers and would also be facing bad business from our side.

● Then we identified that we have 16 places and 21 hostnames that are missing and then we cross-check them with the help of their IDs to verify whether they are missing at random or by chance.

● After analyzing, it seems like these values (hostname and their place name) are missing by chance hence we need to collect this information from the Host acquisition and operations team. But for now, we left these rows blank.

● Finally, we just imputed the missing values of reviews_per_month with a 0.

Below is the code we used to identify missing values. We also imported missingno library to do the same.

# Checking missing values columns

import missingno as msno
msno.bar(airbnb)

# Checking missing values percentages

def null_values(airbnb):
return round((airbnb.isnull().sum()/len(airbnb)*100).sort_values(ascending = False),2)
null_values(airbnb)

Post analyzing and treating the missing values accordingly we treated the spread in the data frame i.e. outliers. Below is the code we used to identify the spread of the outliers.

# Extracting Numeric columns:

int_cols = airbnb.select_dtypes(include=[‘int64’, ‘float64’]).columns

# Tagging them:

list(enumerate(int_cols))

# Plotting the spread of outliers:

plt.figure(figsize=([20,22]))
for n,col in enumerate(int_cols):
plt.subplot(5,2,n+1)
sns.boxplot(airbnb[col], orient = “h”)
plt.xlabel(“”)
plt.ylabel(“”)
plt.title(col)
plt.tight_layout()

The method we used to treat them was by capping them by 10% Below is the code for the same.

# Capping (statistical) outliers

# outlier treatment for price:

Q1 = airbnb.price.quantile(0.10)
Q3 = airbnb.price.quantile(0.90)
IQR = Q3 — Q1
airbnb = airbnb[(airbnb.price >= Q1–1.5*IQR) & (airbnb.price <= Q3 + 1.5*IQR)]

# outlier treatment for minimum_nights:

Q1 = airbnb.minimum_nights.quantile(0.10)
Q3 = airbnb.minimum_nights.quantile(0.90)
IQR = Q3 — Q1
airbnb = airbnb[(airbnb.minimum_nights >= Q1–1.5*IQR) & (airbnb.minimum_nights <= Q3 + 1.5*IQR)]

# outlier treatment for minimum_nights:

Q1 = airbnb.number_of_reviews.quantile(0.10)
Q3 = airbnb.number_of_reviews.quantile(0.90)
IQR = Q3 — Q1
airbnb = airbnb[(airbnb.number_of_reviews >= Q1–1.5*IQR) & (airbnb.number_of_reviews <= Q3 + 1.5*IQR)]

# outlier treatment for reviews_per_month:

Q1 = airbnb.reviews_per_month.quantile(0.10)
Q3 = airbnb.reviews_per_month.quantile(0.90)
IQR = Q3 — Q1
airbnb = airbnb[(airbnb.reviews_per_month >= Q1–1.5*IQR) & (airbnb.reviews_per_month <= Q3 + 1.5*IQR)]

# outlier treatment for calculated_host_listings_count:

Q1 = airbnb.calculated_host_listings_count.quantile(0.10)
Q3 = airbnb.calculated_host_listings_count.quantile(0.90)
IQR = Q3 — Q1
airbnb = airbnb[(airbnb.calculated_host_listings_count >= Q1–1.5*IQR) &
(airbnb.calculated_host_listings_count <= Q3 + 1.5*IQR)]

Looks like we were able to manage the outliers, enough to analyze the information in EDA.

4. Feature Selection / Engineering

The most important step of our Data Preprocessing was to convert some of the numeric features into categorical variables by creating bins of them. Yet post-conversion, we kept the numeric one’s handy tool for analyzing. Below is the table of all the variables that were engineered for our further analyses.

5. Analyzing Methods:

a. Univariate Analysis:

We started our general Univariate Analysis of Numeric and Categorical columns. For numeric columns, we used a Distribution plot from seaborn and for categorical columns, we used a Countplot from the same library seaborn. Below are the codes for the same.

# Extracting and Tagging the Numeric Columns:

int_cols = airbnb.select_dtypes(include=[‘int64’, ‘float64’]).columns
list(enumerate(int_cols))

# Plotting the Numeric Variables Distribution:

int_cols = airbnb.select_dtypes(include=[‘int64’, ‘float64’]).columns
plt.figure(figsize=[20,18])
for n,col in enumerate(int_cols):
plt.subplot(4,2,n+1)
sns.distplot(airbnb[col])

# Checking the count of Neighborhood Groups

plt.figure(figsize=[12,7])
sns.countplot(airbnb.neighbourhood_group)
plt.title(‘Neighborhood — Locations’, fontdict={‘fontsize’: 20, ‘fontweight’: 5, ‘color’: ‘Green’})
plt.show()

Similarly we used the above countplot code for the rest of the categorical plots created.

b. Bi-Multivariate Analysis:

Here we first plotted a pairplot of all the numeric columns using seaborn library in Python itself. Below is the code for the same.

# Plotting the pairplot

sns.pairplot(airbnb)
plt.show()

Parallel to we created all other Bivariate and Multivariate plots using Excel and Tableau.

6. Matrix used for Analysis

In order to measure our analysis, we created a 2x2 Matrix to provide us a direction while creating graphs using different Dimensions and Measures. This matrix involved the values needed to create the graphs with the combinations of,

- Categorical & Numerical

- Categorical & Categorical

- Numerical & Numerical

- Numerical & Categorical

This turns out to be a road map for us, which helps in identifying which all dimensions and measures have been consolidated to get the insights from the data. Below is the Matrix.

7. Evaluation of Methods

The above matrix was evaluated at every step by creating relevant questions to see what we are trying to extract from the raw data. More importantly, to extract the relevant information that we want to recommend to our target audience. Below is the list of some questions that we curated to drive the above matrix for creating graphs.

Evaluating Questions

Which locations are getting more traction?

2. Which locations are price and review-sensitive?

3. What are the pricing ranges preferred by end customers?

4. What type of properties are preferred by the customers?

5. Which properties are available for more days in a year and in which location?

6. In what time period the properties have received more or fewer numbers of reviews?

7. What are the most popular localities and properties in New York currently?

8. Which properties and room types have more or less minimum night stay?

9. How many sites are hosted by a single host and what are its success metrics?

10. Which hosts have received better reviews?

11. Which are the locations that are not performing well based on reviews and other parameters?

12. Which are the room types that are not performing well?

13. Which parameter makes the customer prefer the property and provide a review?

14. Is there any correlation between the prices and reviews or other parameters?

15. Which location has properties functioning for more than 300 days in a year or less than 50 days?

C. Findings & Insights

1. Basic Data Interpretation:

● There are 16 columns and 48895 rows in the data frame.

● There are 3 floats, 7 integers, and 6 objects data type values in the data frame.

● There seem to be many columns with missing values.

● Need to check the reason behind the missing values and some feature engineering needed too.

2. Variables in the data frame:

3. Numeric and Categorical Analysis:

● Latitude and Longitude obviously belong to New York City as we have the data from the same.

● We can see prices starting from 0 dollars going up to 10 grand in dollars for hosting a place. Question: why is there a 0-dollar price for a hosted place on Airbnb and for which place?

● There seems to be a huge variance in minimum_nights, number_of_reviews, reviews_per_month, and calculated_host_listings_count columns. Some things need to be checked in detail too under these columns.

● Manhattan seems to be the friendliest neighborhood group and Williamsburg is the most known location underneath that.

● Hillside Hotel is the top-hosted place which is the Entire home/apt room type.

● This place is hosted by Michael.

4. Numerical Univariate Analysis:

Findings:

● The Highest price range seems to be between 30 dollars to 150 dollars per day stay for most of the sites hosted.

● Still, we can see there are many sites that cost more than 200 dollars per day and can even go up to 500 dollars.

12. Categorical Univariate Analysis:

Findings:

● Most of the time last reviews were not provided when we see Day wise. Next, the majority of times it was provided on the 6th and 7th day of the month followed by the 1st and last day of the month.

● The 6th month of the year i.e June seems to receive most of the last reviews in all years followed by the 5th month.

● Initial years which were 2011 till 2014, last reviews were negligible. After that, it is slowly going up and most of the last reviews are received in the recent years of the data that is 2019 and 2018.

● Most of the sites hosted have less than 100 days of availability in comparison to all 365 days. Also, the majority of them have provided 0 days availability which has to be cross-checked by the Hosting Acquisition and Operations teams to know the reason.

● The majority of the hosts have less than 2 sites hosted by them on the platform.

● Most of the sites have received less than 2 reviews per month which indicates a bad customer experience offered by the majority of sites.

● Also, the Majority of the sites have received less than 50 reviews to date which is kind of less as per social norms.

● The majority of the sites provide less than 10 nights' stay at a time.

● The majority of the sites hosted are either Private rooms or Entire apartments but very less Shared rooms.

● Brooklyn and Manhattan are dominating when it comes to listed hosting followed by Queens.

13. Multivariate Analysis:

# Numeric Correlation Analysis

# Overview of the Major Boroughs (New York City) based on Price Range

# Types of Rooms available for N number of days in a year based on their price range.

# Number of Places hosted by a single host based on their Avg Price and Neighborhood

# an average number of reviews given to places based on the number of days available in a year.

# Price range preferred by the customers

# Name of the Host who has received the highest number of reviews

# Name of the Locations that have received the highest number of the Reviews

# Name of the Hosted Places that have received the highest number of reviews and lies in the High Price range

# Room Types preferred more by Customers

# Room Types available for more minimum night stay contradicting their price range

# Years which have impacted more in customer reviews

# Months in which the end customer are active more in providing reviews

# Top Neighborhoods providing higher number of Minimum Night stay

# Top Locations providing higher number of Minimum Night stay

# Properties contribution more on the Platform

# Locations contributing more on the Platform

# Locations contributing less on the Platform

# Does a higher number of Minimum night stay receive the highest number of reviews

Important Findings:

1. There seems to be no positive or any type of correlation between the numerical variables.

2. Manhattan is the only Neighborhood in the Borough that lies in offering the Highest Price range properties on the platform followed by others with a Medium Price range on average. Prices offered above 120$ on average are a High Price, between 80$ to 120$, the Medium Price range, and less than 80$ to be considered Low Price range property.

3. Having a high price range, Entire home/apt types of rooms are available for less than 100 days on average followed by Private rooms on an average of 105 days and Shared rooms around 155 days on average being the lowest in price.

4. Manhattan has the highest number of places listed around more than 10 by a single host with an average price of 230 $ followed by Brooklyn with an average price of 108$. On the other hand, all the hosts have less than 2 properties listed in either of the Borough on an average price range between 80 $ to 170 $.

5. Brooklyn has received the highest number of reviews based on the availability to stay open for more than 200 days in a year. This is followed by Staten Island and then the Bronx. On the other hand, there are some sites in Staten Island which are not open for a single day at all and hence could be the reason they have received very low reviews from the end consumer. We need to check which are these places and what issues are they facing?

6. Majority of the customers prefer a price range of 120$ to 130$ on average for a stay. As most of them have provided a good number of reviews between this price range.

7. Michael, David, Alex, John, and Daniel are the Top 5 hosts that seem to have received the highest number of reviews for their listed sites and have also sites listed with a High price range.

8. Staten Island — Silver Lake, Staten Island — Richmondtown, Staten Island — Eltingville, Staten Island — Huguenot, and Brooklyn — Manhattan Beach are the Top 5 locations with Low Price ranges that have received the highest number of reviews on average being the lowest in the Price range. On the contrary, Queens — Neposit, Manhattan — NoHo, Manhattan — Tribeca, Staten Island — Willowbrook, and Manhattan — Flatiron District is the highest in the Price range and have received a low number of reviews.

9. ““WELCOME TO BROOKLYN”” PARK SIDE VIEW STUDIO APT”” , ““Oasis on the Park””, ““HELLO BROOKLYN”” PARK SIDE VIEW NEWLY RENO APT””, ““Comfy Home Away From Home/Multiple rooms”, “”LOVE BROOKLYN”” Newly Renovated Studio APT”” and ““Cozy Retreat”” in North Crown Heights”” are the Top 6 listed places that have received the highest number of reviews.

10. On average Entire home/apt types are preferred more by the customers followed by Private rooms and then Shared Rooms. Mostly because they are also available for a higher number of minimum night’s stay window booking as compared to Private and Shared rooms.

11. “Modern Duplex — Central Chelsea!!!” in Manhattan-Chelsea, “Spacious & Bright 3BRs Near Subways, Parks, Shops” in Brooklyn-Cobble Hill, “NYC LUXURY3 BEDROOMS IN MIDTOWN EAST & GYM& BALCONY” in Manhattan-Murray Hill, “An Artist’s Inspiration: Sun-Soaked Chelsea Loft” in Manhattan-Chelsea and “Upper West Side elegance. Riverside” in Manhattan-Upper West Side are the Top 5 hosted places with highest price offerings.

12. “Brooklyn-Williamsburg”, “Brooklyn-Bedford-Stuyvesant”, “Manhattan-Harlem”, “Brooklyn-Bushwick” and “Manhattan-Upper West Side” are some places providing the highest number of minimum nights window for bookmaking Manhattan and Brooklyn are the top neighborhoods in offering maximum-minimum nights stay.

13. The average number of reviews started increasing exponentially after 2015–2016. And the majority of the customers provide a higher number of reviews either between the months of May till July or at the starting of the year which shows the higher booking window in a year.

14. There are 5766 properties that are open for more than 300 days a year. Around 2286 of them are from Brooklyn followed by Manhattan of around 1947 properties. And on the other hand, the properties that stay open for less than 50 days a year belong to Queens or Staten Island.

15. We can confirm that the greatest parameter for any customer to prefer a property and provide a review is having a maximum or minimum night stay window booking and their probability of being open for more days in a year to some extent.

4. Tools Used

- Python is used for Data Understanding, Pre-processing, and general Univariate and Multivariate Analysis.

- Tableau & Excel were used for in-depth Bi-Multivariate Analysis.

5. Background of our Team

- Professional and Academic Experience:

○ Rohit Batra:

● Prolific experience in entrepreneurial marketing and sales.

● Expertise in lead generation, B2B institutional sales and marketing, and creating long-term revenue streams across Information Technology.

● My interest in Data Science stems from working on projects involving Ecommerce Analytics. Actively looking for a change in the Data Science Field.

● I blog about Data Science topics on medium: https://medium.com/@rohit_batra

● LinkedIn link: https://www.linkedin.com/in/rohit-batra/

○ Saydain Sheikh:

● He is a graduate of Rutgers University, majoring in Mathematics.

● A complete fresher with initial experience in Data Administration at Fatima Hospital, Gorakhpur.

● LinkedIn link: https://www.linkedin.com/in/saydain-sheikh/

- Contribution to the project:

○ Rohit: Worked on analysis to present the Head of Acquisition/Operations and User experience.

○ Saydain: Worked on analysis to present to the Lead Data Analyst and Data Analysis manager.

“Both of us have worked on the Methodological Document”

- Location:

○ Rohit is from Mumbai

○ Saydain is from Uttar Pradesh

- Email

○ rohitbatra027@gmail.com

○ saydains786@gmail.com

End Notes:

The real juice of any data project is to evaluate whether the insights shared are clear and actionable. We all love to hear stories, and how appealing could it be if someone could help you understand meticulous data insights with the help of stories.

Data storytelling is an important soft skill to develop as a data professional but there’s not much information to guide you on where to start. Data scientists are known for dotting the i’s and crossing the t’s but most of us face the challenges of showcasing our work with the help of a story.

If you would like to put a glance at a complete Data Storytelling project, a complete case study can be found in my GitHub repo.

Thank You!