EDA Project: An Analysis of Family Tours in India (Part-1)

Ayesha sidhikha
6 min readJun 17, 2024

--

Introduction

  • Travel and tourism in India are vibrant sectors, offering a plethora of experiences for families seeking memorable adventures.
  • In this blog, we embark on a journey to explore the intricacies of family tours in India through a comprehensive analysis.
  • Our goal is to segment customers based on their travel preferences, analyze spending patterns, and evaluate satisfaction levels.
  • With a focus on enhancing customer experience and driving business growth, our study delves into the realm of personalized travel offerings.

Problem statement

  • Segment customers based on their preferences for different types of travel experiences (adventures, nature, hill stations, water activities, religious sites) and analyze their spending patterns and satisfaction levels.

Goal

  • To Enhance Customer Experience and Drive Business Growth through Personalized Travel Offerings.

Focus

  • Personalization
  • Customer Satisfaction
  • and Business Growth.

Objective

Data Overview:

The data is represented in a Pandas DataFrame with a total of 894 entries (rows) and 15 columns.

Package Names: The names of the travel packages.

Prices Before Discount: The original prices of the packages before any discount (non-null in 892 entries).

Prices After Discount: The prices of the packages after discount (non-null in 893 entries).

Days and Nights: The duration of the travel package in terms of days and nights (numeric data).

Cities: The cities included in the travel package.

Activities:

  • Adventures: The number of adventure activities available (non-null in 667 entries).
  • Nature: The number of nature-related activities available (non-null in 795 entries).
  • Hill Station: The number of hill station activities available (non-null in 441 entries).
  • Water Activities: The number of water activities available (non-null in 162 entries).
  • Religious: The number of religious activities available (non-null in 280 entries).

Star Hotels: Information about the hotels, possibly the star ratings.

Travellers: The count of travelers (non-null in 886 entries, stored as float64).

Ratings: The ratings associated with the travel packages (non-null in 882 entries, stored as float64).

Reviews: The number of reviews for the travel packages (non-null in 882 entries, stored as float64).

Data Cleaning

Data Types:

  • The DataFrame contains a mix of data types, including object (strings), int64 (integer), and float64 (floating-point numbers).

Missing Values:

Some columns have missing values, such as ‘Prices Before Discount’, ‘Prices After Discount’, ‘Travellers’, ‘Ratings’, ‘Reviews’ and various activity-related columns.

Memory Usage

The DataFrame occupies approximately 104.9 KB of memory

Price Column Formatting

  • The ‘Prices_Before_Discount’ and ‘Prices_After_Discount’ columns are processed to remove commas and ensure numerical consistency.
  • This is achieved by applying a lambda function to each column using the apply method.

Drop Unnecessary Columns

  • The columns ‘Unnamed: 0.1’ and ‘Unnamed: 0’ are dropped from the DataFrame using the drop method with the specified axis values.
  • This section of code focuses on loading the data, inspecting its structure, and preparing it for further analysis by cleaning and transforming specific columns.
  • The steps taken aim to enhance the overall quality and consistency of the data for subsequent processing

Missing Values Imputation

Numerical Columns

  • Data Inspection: Before proceeding with imputation, the distributions of the original numerical columns: “Prices_Before_Discount”, “Prices_After_Discount”, “Travellers”, “Reviews”, and “Ratings” are examined.
  • Imputation Techniques: Various imputation techniques such as mean, median, and mode imputation, as well as bfill, ffill, and interpolation are explored.
  • Technique Selection: After comparing the distributions of the original and imputed data using each technique, the imputation method that closely aligns with the original distribution for each numerical column is selected.

Selected Imputation Techniques

  • Prices_Before_Discount and Prices_After_Discount: Missing values are filled with the median value of each respective column.
  • Travellers, Reviews, and Ratings: Missing values are filled using forward fill (ffill) method to propagate the last valid observation.

Categorical Columns

  • For categorical columns with missing values, rather than imputing with the mode (most frequent value), missing values are imputed with the label “Not Available”.

Reasoning

  • The choice of “Not Available” serves to explicitly identify and retain the information that certain data points were originally missing.
  • This approach maintains transparency in the dataset, ensuring that the imputed values do not skew the original distribution of the categorical variables.

Type Casting

Data Type Conversion

  • The following columns have been converted to the int64 data type: Prices_Before_Discount, Prices_After_Discount, Ratings, Travellers,Reviews

Conversion Method

  • The specified columns have been converted to the int64 data type using the astype method in pandas.

Conclusion for Data Preprocessing

  • In conclusion, data preprocessing involved addressing missing values, aggregating data, and ensuring data types were appropriate for analysis.
  • Visualization aided in understanding the distribution of key variables, both in their original and imputed states.

These steps are crucial for ensuring the quality and reliability of the dataset before proceeding with any further analysis or modeling.

Preserving Pricing Diversity

Non-Treatment of Outlier

1. Different Packages Have Different Prices: 

  • Travel Triangle offers many types of travel packages, each with its own features and prices. 
  • The price of a package depends on what it offers, so calling some prices unusual doesn’t make sense for all packages.

2. Showing Real Market Prices: 

  • Some unusual prices might actually be normal in the market.  Keeping these prices in our data helps us understand the real range of prices people pay for travel packages

3. Not Confusing People’s Choices:

  • Ignoring unusual prices could make it seem like there are fewer choices than there really are. 
  • Keeping all prices in our data helps everyone see all the options without being misled

Conclusion

  • In conclusion, by preserving outliers in pricing data, we ensure a genuine portrayal of the diverse pricing landscape on the Travel Triangle platform.
  • This approach maintains data integrity and transparency, providing an accurate reflection of the real range of prices available to travelers.

This blog addressed missing values through strategic imputation methods, preserved pricing outliers for authenticity in Travel Triangle’s diverse pricing landscape, and optimized data integrity with precise type casting.

In Part 2, we will delve into how we selected and used the Fyn Palette, created with Coolors, for our data visualizations. This palette was generated to enhance the clarity and professionalism of our analysis.

Thank you for reading; I hope you found it informative and valuable!😍😍😊😊

--

--

Ayesha sidhikha

Enthusiastic learner passionate about data science and generative AI. Committed to tackling challenges and continuous growth in technology.