Navigating Your Data: A Step-by-Step Guide to Exploratory Data Analysis
What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a crucial first step in any data science project. It allows one to understand the data's characteristics, uncover patterns, and identify potential issues before diving into more advanced analysis. In this post, I will walk through the key steps of EDA using a real-world case study: predicting hotel cancellations. This example will help illustrate how EDA can be applied in a practical context.
Step 1: Understand Your Data
Before diving into analysis, it’s essential to get a clear picture of what your data represents. For our hotel cancellation prediction project, we will use a dataset that includes various features related to hotel bookings, such as:
- Booking information (e.g., arrival date, length of stay, average price per room)
- Cancellation status (e.g., cancelled or not cancelled)
- Identify the data source: Understand where your data comes from and how it was collected. In our case, we can find this dataset here.
- Review the data dictionary: If available, study the data dictionary to understand what each variable represents.
- Check the data structure: Examine the number of rows and columns, column names, and data types.
# For reading and manipulating data
import pandas as pd
# Load the hotel cancellation dataset
df = pd.read_csv('hotel_cancellations.csv')
# Display basic information
print(df.info())
print(df.head())
Step 2: Clean and Preprocess the Data
Data cleaning is crucial for accurate analysis. For our hotel dataset, we need to ensure that the data is ready for exploration:
- Handle missing values: Identify and address missing data.
- Deal with duplicates: Remove or flag duplicate entries.
- Correct data types: Ensure each column has the appropriate data type.
- Address outliers: Identify and decide how to handle extreme values.
# Check for missing values
print(df.isnull().sum())
# Remove duplicates
df = df.drop_duplicates()
# Convert data types
df['arrival_date'] = pd.to_datetime(df['arrival_date'])
# Ensure booking_status is categorical
df['booking_status'] = df['booking_status'].astype('category')
In the image above, there are 36,275 rows and 10 columns in the dataset.
Step 3: Descriptive Statistics
Let’s get some summary statistics for our numerical variables. For our hotel dataset, we want to understand the distribution of features like lead time, total nights, and average price.
- Central tendency: Calculate mean, median, and mode.
- Dispersion: Examine range, variance, and standard deviation.
- Distribution shape: Look at skewness and kurtosis.
# Generate descriptive statistics
print(df.describe())
# Also yoou can print the statistic summary of all the features
df.describe(include='all').T
# For categorical variables
print(df['market_segment_type'].value_counts())
print(df['booking_status'].value_counts(normalize=True))
This will give us an overview of the central tendencies and distributions of our numerical features, as well as the frequency of categories in our categorical variables.
For example, the average lead time is approximately 85 days, with a standard deviation of about 86 days. This indicates a high variability, suggesting that some bookings are made very close to the arrival date, while others are made well in advance. The substantial difference between the 75th percentile (126 days) and the maximum value (443 days) suggests the presence of potential outliers. In terms of market segmentation, the majority of bookings fall under the ‘Online’ category.
(df['lead_time']==0).sum()
This returns the number of bookings where the lead time is zero days. Specifically, this result shows that there are 1297 bookings with a lead time of zero days. Now, a lead time of zero days indicates bookings made on the same day as the arrival date. This high number of last-minute bookings might suggest that a significant portion of your bookings occur spontaneously or under urgent circumstances. Hence, this is an important insight that we should consider in our analysis and feature engineering.
Step 4: Visualize Distributions
Creating visualizations helps to understand the distribution of your variables. The most common visualizations are:
- Histograms: For continuous variables like lead time and total nights.
- Bar charts/Pie charts: Ideal for visualizing categorical variables, such as booking type (online or offline) or cancellation status.
- Box plots: To identify outliers and compare distributions.
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram for lead time
plt.figure(figsize=(10,6))
sns.histplot(df['lead_time'], bins=30)
plt.title('Distribution of Lead Time')
plt.show()
As observed in the table created with the describe() method, the histogram distribution shows that the lead time is right-skewed, with most bookings occurring closer to the check-in date but a significant number extending far in advance. The skewness indicates that while the average lead time is around 85 days, a large portion of bookings are made with a shorter lead time, and a few extreme values increase the average. This insight can help identify potential booking patterns and outliers that may affect hotel operations and strategies.
# Bar plot for market_segment_type
plt.figure(figsize=(10, 6))
df['market_segment_type'].value_counts().plot(kind='bar')
plt.title('Distribution of Market Segment Types')
plt.xlabel('Market Segment Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# Pie chart for booking_status
plt.figure(figsize=(8, 8))
df['booking_status'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Booking Status Distribution')
plt.ylabel('')
plt.show()
Visual representations enable us to quickly identify trends, patterns, and insights that might be challenging to discern in raw data or text. For example, the pie chart for booking status reveals that 67.2% of bookings were not canceled, while 32.8% were.
Step 5: Explore Relationships
Investigate relationships between variables, especially those that might influence cancellations:
- Correlation analysis: For numerical variables.
- Scatter plots: To visualize relationships between lead time and cancellation.
- Pair plots: For a quick overview of multiple relationships.
- Stacked Bar Plot: To visualize the distribution of booking statuses across different categorical variables, such as
market_segment_type
, to understand how various segments contribute to cancellations.
# Stacked bar plot for market_segment_type and booking_status
plt.figure(figsize=(10, 5))
pd.crosstab(data['market_segment_type'], data['booking_status'], normalize='index').plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Booking Status by Market Segment Type')
plt.xlabel('Market Segment Type')
plt.ylabel('Proportion')
plt.legend(title='Booking Status')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
- The cancellation rate is higher for the online market segment (33.91%) compared to the offline segment (29.95%). This difference, though not enormous, could be significant depending on the context.
Step 6: Analyze Patterns Over Time
If your data includes a time component, such as booking dates, it’s essential to analyze trends over time:
- Time Series Plots: Use these to visualize trends in bookings and cancellations across different months.
- Seasonal Decomposition: This technique helps identify and isolate seasonal patterns in cancellations.
Our dataset includes the arrival_date
, which has been converted to the datetime
datatype. This enables us to extract and analyze additional temporal information. Let’s delve into how bookings and cancellations fluctuate over time.
# Extract month and day of week from arrival_date
df['arrival_month'] = df['arrival_date'].dt.month
# Plot bookings by month
monthly_bookings = df.groupby('arrival_month')['booking_status'].value_counts(normalize=True).unstack()
monthly_bookings.plot(kind='bar', stacked=True)
plt.title('Booking Status by Month')
plt.xlabel('Month')
plt.ylabel('Proportion')
plt.legend(title='Booking Status')
plt.xticks(rotation=0)
plt.show()
Grouping and Counting:
data.groupby('arrival_month')['booking_status']
groups the dataset by thearrival_month
column. For each month, it counts the occurrences of each unique value in thebooking_status
column (e.g., 'Canceled', 'Not_Canceled').
Normalization:
value_counts(normalize=True)
calculates the proportion of eachbooking_status
category within each month. Thenormalize=True
argument ensures that the counts are converted into proportions, making it easier to compare the relative frequencies across different months.
Unstacking:
.unstack()
reshapes the resulting data so that eachbooking_status
category becomes a separate column, witharrival_month
as the index. This structure is ideal for creating a stacked bar chart, where each month's total booking proportion is divided between canceled and not canceled statuses.
Plotting:
monthly_bookings.plot(kind='bar', stacked=True, ax=plt.gca())
creates a stacked bar plot. Thekind='bar'
argument specifies that a bar chart is desired, andstacked=True
indicates that the bars for each month should be stacked on top of each other, showing the proportion of eachbooking_status
within that month.
Plot on Existing Axes:
ax=plt.gca()
ensures that the plot is rendered on the current active axes, which were created by theplt.figure()
call. This approach maintains control over the figure's size and layout.
This will help us identify any seasonal patterns and effects on bookings and cancellations. Here, July has the highest cancellation rate
Step 7: Feature Engineering
Based on your findings, create new features that might be useful:
- Combine existing features: Create meaningful combinations of variables.
- Bin continuous variables: Create categorical variables from continuous ones.
- Extract information: For example, extract day of week from dates.
# Example 1: Bin continuous variables
# Categorize lead time
df['lead_time_category'] = pd.cut(df['lead_time'],
bins=[0, 7, 30, 90, float('inf')],
labels=['Last Minute', 'Short Term', 'Medium Term', 'Long Term'])
# Example 2: Combine existing features
# Price per person
df['price_per_person'] = df['avg_price_per_room'] / df['no_of_adults']
# Example 3: Combine existing features
# Total nights
df['total_nights'] = df['no_of_weekend_nights'] + df['no_of_week_nights']
# Example 4: Bin continuous variables
# Categorize price
df['price_category'] = pd.qcut(df['avg_price_per_room'], q=4, labels=['Budget',
'Economy', 'Standard', 'Luxury'])
# Example 5: Extract information
# Extract day of week from arrival_date
df['arrival_day_of_week'] = df['arrival_date'].dt.day_name()
- Lead Time Category: We’ve categorized the lead time into four groups. This can help us understand if booking behavior differs based on how far in advance the reservation was made.
- Price per Person: This feature might reveal insights about the type of bookings (e.g., single travelers vs. families)
- Total Nights: By combining weekend and weekday nights, we get a clearer picture of the overall length of stay.
- Price Category: We’ve divided the prices into quartiles, which can help us understand if cancellation behavior differs across price ranges.
- Arrival Day of the Week: Extracts the day of the week from the
arrival_date
, providing insights into booking patterns related to specific days.
Next, we analyze the new features created.
# Analyze some of the new features
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='total_nights', hue='booking_status')
plt.title('Booking Status by Total Nights')
plt.xticks(rotation=90)
plt.show()
These new features and visualization provide a more nuanced view of our data, potentially uncovering patterns that weren’t immediately apparent in the raw data. They can guide our further analysis and model development for predicting hotel cancellations.
Recall that we have a significant number of same-day bookings (lead time = 0), we should adjust our lead time categorization to include this as a separate category.
df['lead_time_category'] = pd.cut(df['lead_time'],
bins=[-1, 0, 7, 30, 90, float('inf')],
labels=['Same Day', 'Last Minute', 'Short Term', 'Medium Term', 'Long Term'])
We can also do some feature engineering for same-day bookings. We could create a binary feature for same-day bookings:
df['is_same_day_booking'] = (df['lead_time'] == 0).astype(int)
This feature could be useful in predictive models, as same-day bookings might have different characteristics or cancellation patterns
Conclusion
By following these steps, we have gained a deep understanding of the hotel cancellation dataset, setting a solid foundation for more advanced analysis and modeling.
Remember, EDA is an iterative process. As you uncover new insights, you may want to revisit earlier steps or explore new angles. The goal is to develop a deep understanding of your data, which will inform your modeling decisions and ultimately lead to more accurate predictions of hotel cancellations in this case.