In-Depth Exploratory Data Analysis on Airbnb NYC Listings 2019 Dataset

Kshitija Chilbule
Python’s Gurus
Published in
13 min readJun 2, 2024

Introduction

Airbnb is an online marketplace that connects people who want to rent out their homes with people looking for accommodations in that locale. NYC is the most populous city in the United States and one of the most popular tourist and business places globally.

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Nowadays, Airbnb has become of a kind service that is used by the whole world. Data analysis become a crucial factor for the company that provides millions of listings through Airbnb. These listings generate a lot of data that can be analyzed and used for security, business decisions, understanding customers’ and providers’ behavior on the platform, implementing innovative additional services, guiding marketing initiatives, and much more.

In this Data Analysis project, I immersed myself in the Airbnb NYC Listings 2019 dataset, conducting a thorough analysis to extract valuable insights.

Note: The dataset is taken from Kaggle. The link for the dataset is given below

Data Summary

Id: Unique for each Property Listing.

name: Name of each Property Listing.

host_id: Unique ID for a host who has listed the property on Airbnb.

host_name: Name of host

neighbourhood_group: Name of Each borough of NYC, Manhattan, Brooklyn, Queens, Bronx, State Island.

neighbourhood: Area in each borough of NYC

latitude, longitude: Coordinates of each listed property

room_type: Different types of room available for listing, Private room, Entire home/apt, Shared room.

price: Price of listing.

minimum_nigths: Mandatory number of nights to be booked for available for each type of property.

number_of_review: Number of reviews for each Listed property

last_review: Date on which last time the listing was reviewed

review_per_month: Number of reviews per month

calculated_host_listings_count: Number of listing each host owns

availablity_365: Number of days the given listing is available for booking

Let’s dive in…

Import all the necessary libraries

The initial step involves importing essential Python libraries, such as Pandas for dataframe operations, along with data visualization libraries like Matplotlib, and Seaborn.

# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Importing and Reading the dataset

During this step, we’ll load the dataset and examine its attributes. Since our dataset is in CSV format, we’ll utilize the .read_csv() function to load it.

# Importing the dataset
df = pd.read_csv("Airbnb.csv")
# Reading the dataset
df.head()

Shape of the dataset

df.shape

Unique columns in the dataset

df.columns

Data types of attributes in the dataset

df.dtypes

Information of the dataset

df.info()

Statistical Description of the dataset

df.describe()

Observation: Prices of the bookings vary from 0 to $10,000

Checking for Duplicated records in the dataset

In this step, I will check for any duplicate records within the dataset. If duplicates are identified, I will proceed to remove them.

df.duplicated().sum()

Observation: Dataset do not contain duplicated records

Checking for Null/Empty Values in the dataset

In this step, our initial task is to ascertain if the dataset contains any null values. If null values are detected, our course of action will be to eliminate them.

df.isnull().sum()

Observation:

As we can see, dataset do contain null/empty values

  • The “name” feature contains 16 null values
  • The “host_name” contains 21 null values
  • The “last_review” contains 10052 null values
  • The “reviews_per_month” also contains 10052 null values

Replacing the null values with appropriate values

df['name'].replace(np.nan, 'Other Hotel', inplace =True)
df['host_name'].replace(np.nan, 'other', inplace = True)
df['last_review'].replace(np.nan, 'Not Reviewed', inplace = True)
df['reviews_per_month'].replace(np.nan, '0', inplace = True)
# Checking if all null values removed or not
df.isnull().sum()

Observation: All the null/empty values are being replaced by appropriate values

Removing unnecessary attributes

With our goal in mind, I aim to eliminate any unnecessary columns from the dataset in this step, as they serve no purpose.

df.drop(['id', 'name', 'last_review'], axis = 1, inplace = True)
df.columns

Observation: We can see, our dataset is now free from the unnecessary columns

Addressing Some Analytical Questions

Question 1: What are the top 10 host IDs with the highest number of bookings?

# getting value counts
df['host_id'].value_counts().iloc[:10]
# Visualizing top 10 host IDs with the highest number of bookings
top_10_host_IDs = df['host_id'].value_counts().iloc[:10]
# Plotting
plt.figure(figsize=(12, 6))
ax = top_10_host_IDs .plot(kind='bar', color='grey')
for bars in ax.containers:
ax.bar_label(bars)
plt.title('Top 10 Host IDs with the Highest Number of Bookings')
plt.xlabel('Host IDs')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
plt.tight_layout()
plt.show()
# Percentage of bookings for Top 10 Host ID's 
hostidPer = (df['host_id'].value_counts().iloc[:10].sort_values(ascending=False)/len(df))*100
hostidPer

Observation:

  1. The first Host ID from the top 10 host id’s has 327 bookings constituting to 66.8% of Total bookings
  2. The 10th Host ID from the top 10 host id’s has only 52 bookings

Question 2: What are the top 10 host Names with the highest number of bookings?

# Getting value counts 
df['host_name'].value_counts().iloc[:10]
# Visualizing the Top 10 Host Names with the highest number of bookings
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=df['host_name'].value_counts().iloc[:10].keys(), y=df['host_name'].value_counts().iloc[:10], palette="viridis")
for bars in ax.containers:
ax.bar_label(bars)
plt.title("Top 10 Host Names with the highest number of bookings", fontsize=16)
plt.xlabel("Host Name", fontsize=12)
plt.ylabel("Number of Bookings", fontsize=12)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Percentage of bookings for Top 10 Host Names
hostnamePer = (df['host_name'].value_counts().iloc[:10].sort_values(ascending=False)/len(df))*100
hostnamePer

Observation:

  1. The host named Michael has 417 bookings attributed to him, accounting for 85% of the total bookings.
  2. The person with the Name David stands at the second position with the total bookings of 403.

Question 3: What types of rooms does the host with the highest number of bookings offer, and what is the price range for these rooms?

# Rooms that Michael offers
df.loc[df['host_name']=="Michael"]['room_type'].unique()
# Total bookings
df.loc[df['host_name']=="Michael"]['room_type'].count()
# Count of bookings for each room type
df.loc[df['host_name']=="Michael"]['room_type'].value_counts()
# Price Description
df.loc[df['host_name']=="Michael"]['price'].describe()

Observation:

Michael, the host with the highest number of bookings, offers all room types, including Private rooms, Entire home/apts, and Shared rooms. Specifically, he has 251 Entire home/apts, 152 private rooms, and 14 Shared rooms. The price range for these accommodations is between 25 and 1700 (dollars)

Question 4: Which Neighbourhood group has the highest number of bookings?

# Getting value counts
df['neighbourhood_group'].value_counts()
# Visualizing neighbourhood groups with the highest number of bookings
neightop = df['neighbourhood_group'].value_counts()
# Plotting
plt.figure(figsize=(12, 6))
ax = neightop.plot(kind='bar', color='skyblue')
for bars in ax.containers:
ax.bar_label(bars)
plt.title('Neighbourhood Groups with the Highest Number of Bookings')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
plt.tight_layout()
plt.show()
# Percentage of bookings for Neighbourhood groups
neighbourhood_grpPer = (df['neighbourhood_group'].value_counts().sort_values(ascending=False)/len(df))*100
neighbourhood_grpPer
# Visualizing using pie chart
df['neighbourhood_group'].value_counts().plot(kind = 'pie', figsize = (8,8), fontsize = 15, autopct = '%1.1f%%')
plt.title("Neighbourhood Group", fontsize = 15)

Observation:

  1. An observation reveals that among all the neighborhood groups, the Manhattan group has the highest number of bookings, totaling 21,661, which constitutes 44.3% of all bookings across all groups.

2. Brooklyn ranks as the second-highest neighborhood group with a total of 20,104 bookings, covering 41% of all bookings.

3. Staten Island is the neighbourhood group with the least number of bookings which constitutes only 0.76% of all the bookings

Question 5: Which Neighbourhood Group has the maximum price range for rooms?

plt.figure(figsize = (15,6))
sns.boxplot(x = df['price'])
plt.show()
df['price'].describe()
# Probability Density Function
plt.figure(figsize = (15,6))
sns.distplot(df['price'], color = 'grey', hist_kws={"linewidth": 15,'alpha':1})
plt.title("Probability Distribution", fontsize = 15)
plt.xlabel('Price', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.show()
# Calculating Interquartile Range
Q1 = np.percentile(df['price'], 25, interpolation = 'midpoint')

# Third quartile (Q3)
Q3 = np.percentile(df['price'], 75, interpolation = 'midpoint')

# Interquaritle range (IQR)
IQR = Q3 - Q1

print('The IQR is',IQR)
print('The Minimum value is', (Q3 - (1.5* (IQR))))
print('The maximum value is', (Q3 + (1.5* (IQR))))

Observation: As we can see 99% of the data lies within 334 dollars with the mean being 153 and the median 106.

df_new = df[df['price'] < 334 ]
df_new.head()
df.groupby(['neighbourhood_group'])['price'].describe().T.reset_index()

Observation:

  1. The price range for the Bronx Neighbourhood group is in the range 0 and 2500
  2. The price range for the Brooklyn Neighbourhood group is in the range 0 and 10000
  3. The price range for the Manhattan Neighbourhood group is in the range of 0 and 10000
  4. The price range for Queens Neighbourhood group is in the range 10 and 10000
  5. The price range for Staten Island Neighbourhood group is in the range 13 and 5000
plt.figure(figsize = (15,6))
sns.violinplot(data = df_new, x = df_new['neighbourhood_group'], y = df_new['price'])
plt.title('Density and distribution of prices for each neighberhood_group', fontsize = 15)
plt.grid()
plt.figure(figsize = (16,15))

plt.subplot(3,2,1)
n1 = df_new[df_new['neighbourhood_group'] == 'Brooklyn']
sns.distplot(x = n1['price'])
plt.title("Brooklyn", fontsize = 15)

plt.subplot(3,2,2)
n2 = df_new[df_new['neighbourhood_group'] == 'Manhattan']
sns.distplot(x = n2['price'])
plt.title("Manhattan", fontsize = 15)

plt.subplot(3,2,3)
n3 = df_new[df_new['neighbourhood_group'] == 'Queens']
sns.distplot(x = n3['price'])
plt.title("Queens", fontsize = 15)

plt.subplot(3,2,4)
n4 = df_new[df_new['neighbourhood_group'] == 'Staten Island']
sns.distplot(x = n4['price'])
plt.title("Staten Island", fontsize = 15)

plt.subplot(3,2,5)
n5 = df_new[df_new['neighbourhood_group'] == 'Bronx']
sns.distplot(x = n5['price'])
plt.title("Bronx", fontsize = 15)

Observation:

  1. we can observe that Manhattan has the highest range of prices for the listings with 150 prices as median observation, followed by Brooklyn with 90 per night
  2. Queens and Staten Island appear to have very similar distributions, Bronx is the cheapest of them all.

Question 6: What are the Top 10 Neighbourhoods having highest number of bookings?

df['neighbourhood'].value_counts().iloc[:10]
# Visualizing the Top 10 Neighbourhoods with the highest number of bookings
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=df['neighbourhood'].value_counts().iloc[:10].keys(), y=df['neighbourhood'].value_counts().iloc[:10], palette="autumn")
for bars in ax.containers:
ax.bar_label(bars)
plt.title("Top 10 Neighbourhoods with the highest number of bookings", fontsize=16)
plt.xlabel("Neighbourhood", fontsize=12)
plt.ylabel("Number of Bookings", fontsize=12)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Percentage of bookings for Top 10 Neighbourhoods
NeighbourhoodsPer = (df['neighbourhood'].value_counts().iloc[:10].sort_values(ascending=False)/len(df))*1000
NeighbourhoodsPer

Observation:

  1. The Williamsburg neighborhood has the highest number of bookings, totaling 3,920, which constitutes 80% of all bookings.
  2. The Bedford-Stuyvesant constitutes 75% of bookings with the total bookings of 3714

Question 7: Which room type has highest number of bookings?

# Getting the value counts
df['room_type'].value_counts()
# Visualizing using Count Plot
ax = sns.countplot(x = 'room_type',data = df, palette="Set2")

for bars in ax.containers:
ax.bar_label(bars)
# Percentage of bookings individual room type
room_typeBookings = (df['room_type'].value_counts().sort_values(ascending=False)/len(df))*100
room_typeBookings
plt.figure(figsize=(12, 6))
plt.pie(room_typeBookings, labels=room_typeBookings.index, autopct='%.0f%%', startangle=140)
plt.title("Bookings by Room Type", fontsize=14)
plt.axis('equal')
plt.show()

Observation:

Entire home/apt has the highest number of bookings, accounting for 52% of the total bookings, with a total of 25,409 bookings. Private room follows closely behind with a total of 22,326 bookings, covering 46% of the total bookings. Shared rooms have the least number of bookings.

Question 8: What is the Average price for each room type?

# Visualizing average price for each room type
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=df.groupby(['room_type'])['price'].mean().keys(), y=df.groupby(['room_type'])['price'].mean(), palette="summer" )
for bars in ax.containers:
ax.bar_label(bars)
plt.title("Average price for each room type", fontsize=16)
plt.xlabel("Room Type", fontsize=12)
plt.ylabel("Price", fontsize=12)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Observation:

  1. The average price for the Entire home/apt room type is 211.79(dollars).
  2. The average price for the Private room room type is 89.78(dollars).
  3. The average price for the Shared room type is 70.12(dollars).

Question 9: What are the average minimum nights for different room types?

plt.figure(figsize=(12, 6))
ax = sns.barplot(x=df.groupby(['room_type'])['minimum_nights'].mean().keys(), y=df.groupby(['room_type'])['minimum_nights'].mean(), palette="winter" )
for bars in ax.containers:
ax.bar_label(bars)
plt.title("Minimum Average Stay", fontsize=16)
plt.xlabel("Room Type", fontsize=12)
plt.ylabel("Minimum Nights", fontsize=12)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Observation:

  1. The minimum days to stay in entire home/apt are 9 days
  2. The minimum number of days to stay in a private room is 5 days.
  3. The minimum number of days to stay in a shared room is 6 days.

Question 10: What are the top 10 number of days on which highest number of bookings were done?

ax = df['availability_365'].value_counts().iloc[:10].sort_index().plot(kind = 'bar', figsize = (12,6), color = 'grey', fontsize = 10)
for bars in ax.containers:
ax.bar_label(bars)
plt.xticks(rotation = 360)
plt.xlabel('Availability 365', fontsize = 15)
plt.ylabel("Bookings")

Question 11: What is the average number of reviews for each room type?

df.groupby(['room_type'])['number_of_reviews'].mean()

Question 12: Which Neighbourhood group got the highest number of reviews?

df.groupby(['neighbourhood_group'])['number_of_reviews'].count().sort_values(ascending=False)

Observation:

The Manhattan neighborhood group received the highest number of reviews, followed by Brooklyn with a total of 20,104 reviews.

Key Findings:

  • The first Host ID from the top 10 host id’s has 327 bookings constitutes to 66.8% of Total bookings
  • The host named Michael has 417 bookings attributed to him, accounting for 85% of the total bookings, followed by David with the total bookings of 403.
  • Michael, the host with the highest number of bookings, offers all room types, including Private rooms, Entire home/apts, and Shared rooms. Specifically, he has 251 Entire home/apts, 152 private rooms, and 14 Shared rooms. The price range for these accommodations is between 25 and 1700 dollars.
  • David, the host with the second-highest number of bookings, provides all types of rooms, including Private rooms, Entire home/apts, and Shared rooms. Specifically, he offers 214 Entire home/apts, 184 private rooms, and 5 Shared rooms. The price range for these accommodations falls between $25 and $2000.
  • An observation reveals that among all the neighborhood groups, the Manhattan group has the highest number of bookings, totaling 21,661, which constitutes 44.3% of all bookings across all groups, followed by Brooklyn which ranks as the second-highest neighborhood group with a total of 20,104 bookings, covering 41% of all bookings.Staten Island is the neighbourhood group with the least number of bookings which constitutes only 0.76% of all the bookings.’
  • Observation says that Manhattan has the highest range of prices for the listings with 150 price as median observation, followed by Brooklyn with 90 per night.
  • Bronx Neighbourhood is the cheapest among all the neighbourhoods.
  • The Williamsburg neighborhood has the highest number of bookings, totaling 3,920, which constitutes 80% of all bookings.
  • Entire home/apt has the highest number of bookings, accounting for 52% of the total bookings, with a total of 25,409 bookings. Private room follows closely behind with a total of 22,326 bookings, covering 46% of the total bookings. Shared rooms have the least number of bookings.
  • Surprisingly, Private rooms got more reviews than Entire Home/apt.
  • The Manhattan neighborhood group received the highest number of reviews, followed by Brooklyn with a total of 20,104 reviews.

Below attached is my Github Link:

Thanks for Reading !!

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ️👏️️
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--

Kshitija Chilbule
Python’s Gurus

I am an aspiring MLOps Engineer and I have a passion for uncovering and narrating the stories hidden within data.