Boosting Bookings with Data: Enhancing the Airbnb Experience

Ashwin Shah
42 min readOct 31, 2023

--

Photo by Clay Banks on Unsplash

Welcome to a data-driven journey through the world of Airbnb bookings! In this blog post, we’ll embark on a fascinating exploration of how data insights can transform the way you book and experience accommodations on Airbnb. We’ll delve into the results of an extensive Exploratory Data Analysis (EDA) that unlocks the hidden gems within Airbnb’s vast dataset. Our mission is to interpret these findings, turning them into actionable insights for Airbnb’s marketing and user experience teams.

As we journey through this data-rich landscape, you’ll discover not only what drives user behavior and booking decisions but also how to leverage this knowledge to boost bookings and enhance the overall Airbnb experience. Whether you’re a host seeking to attract more guests or a traveler looking for the perfect stay, the recommendations derived from our data analysis will empower you to make more informed decisions.

In this post, we won’t just stop at insights; we’ll also explore practical strategies for targeting specific user segments, optimizing booking processes, and elevating user engagement. By the end of this journey, you’ll be equipped with the tools to supercharge your Airbnb adventures and bring your travel dreams closer to reality. Let’s begin our quest to enhance your Airbnb experience with the power of data!

We will use the Airbnb Dataset to perform our analysis. You can find the dataset from Kaggle by clicking the Airbnb Dataset.

Let’s get started!

Click for Google Collab Link for complete Note book

Table of Contents

  1. Understanding Business Case
  2. Data Set Introduction
  3. Exploratory Data Analysis (EDA)
  4. Feature Engineering
  5. Univariate Analysis
  6. Bivariate Analysis
  7. Multivariate Analysis
  8. Feature Importance
  9. Trends and Patterns
  10. Business Recommendations

1. Understanding Business Case

Airbnb, a trailblazer in the sharing economy, has transformed the travel accommodation landscape. To improve user experience, we explore a comprehensive Kaggle dataset covering user info, demographics, web sessions, and stats. Our aim is to unearth insights on user behavior, preferences, and booking factors, guiding Airbnb’s marketing and user experience enhancement. From the data set I could deduce not only the requirements but also the purpose.

Came up with possible Business Requirement and Purpose

2.Data Set Introduction

Let’s kick off our data-driven adventure by getting to know the star of the show — the dataset. We’ll uncover its origins, the time span it covers, and other key context. This dataset is our window into the world of Airbnb, so let’s take a peek behind the curtain and set the stage for our data exploration.

If you haven’t snagged the data from Kaggle yet, now’s the time! It’s the key to unlocking the fascinating world of Airbnb. Let’s dive in and start our data-driven adventure. Get the data here.

Unlocking the Airbnb Treasure Trove!🌍🧳

Our adventure takes place in the USA, where a fascinating group of users are planning their dream getaways. But, you see, there’s a twist — there are 12 possible outcomes, or shall we say, 12 dream destinations. You’ve got your classic favorites like the ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’, and ‘DE’. Then there are the adventurers heading to ‘AU’ down under, while others are in for a bit of a mystery — ‘NDF’ (no destination found). Oh, and let’s not forget the curious travelers in ‘other’, the ultimate wanderers!

But, here’s where the real fun begins! 🎉

We’re not just looking at data; we’re time-travelers too! Our training and test sets are split by dates, and in the test set, we get to predict the future — all the new users with first activities after 7/1/2014. Who says you can’t predict the future?

Now, here’s a quick guide to our treasure map, I mean, our files:

train_users.csv: Meet our training set of users, complete with user IDs, dates, and all the travel details. It’s like the traveler’s diary!

  • id: user id
  • date_account_created: the date of account creation
  • timestamp_first_active: timestamp of the first activity (can be earlier than date_account_created)
  • date_first_booking: date of first booking
  • gender
  • age
  • signup_method
  • signup_flow: the page a user came to sign up from
  • language: international language preference
  • affiliate_channel: what kind of paid marketing
  • affiliate_provider: where the marketing is (e.g., Google, Craigslist, other)
  • first_affiliate_tracked: the first marketing the user interacted with before signing up
  • signup_app
  • first_device_type
  • first_browser
  • country_destination: target variable you are to predict

test_users.csv: The future is in our hands! This is where we predict the destiny of new users after 7/1/2014.

sessions.csv: Explore the web sessions log for users. This is where the action happens, quite literally!

  • secs_elapsed

countries.csv: A summary of our destination countries and their secret hideouts. Where in the world are our users headed?

age_gender_bkts.csv: A glimpse into the age groups, gender, and preferred countries. What’s your dream group?

sample_submission.csv: The secret recipe for submitting your predictions. This is your chance to be the travel oracle!

So, grab your compass, your data-driven mindset, and let’s embark on this adventure. The Airbnb treasure trove awaits! 🗺✈️🌟

Navigating the Data Labyrinth

Picture this: 16 features in the user table and a whopping 10 million records in the session table. It’s like stepping into a data jungle! At first, it felt like an amusement park — thrilling and overwhelming. Where to even begin?

But then, something magical happened. I took a deep breath, calmed my racing thoughts, and decided to break this colossal challenge into bite-sized pieces. 🍽️

First Booking: I started with the cornerstone — the first booking. It’s the beginning of every traveler’s journey, and understanding it was crucial.

User Behavior and Preferences: Like pieces of a puzzle, I started collecting insights into user behavior and preferences. It was like peeking into their travel dreams.

Most Preferred Action Types and Details: The session table was chatty, but I knew there were gems hidden within. I sifted through to find the most preferred action types and details.

User Behavior Insights: With each iteration, I dove deeper into the user behavior ocean. Patterns started to emerge, and I was onto something big.

Lead Time: The time between planning and booking — it’s a crucial piece of the puzzle. I started examining lead times, uncovering booking patterns.

Booking Times and Patterns: The clock held secrets. Booking times and patterns revealed themselves, like stars in the night sky.

Destinations Travel: Ah, the dream destinations! I started connecting the dots between users and their dream getaways.

User Demographics: It was time to get to know our users. Demographics shed light on who they are and what they seek.

Booking Channels and Devices: I examined the channels and devices users used, like a digital detective piecing together the story.

And then, the next challenge surfaced — the elusive target variable, “country_destination.” It was like chasing a ghost in the data. The session table was chatty, but it didn’t easily give up its secrets.

So, I had an light bulb moment — focus on dates! “booking_date” became my guiding star. It held the key to understanding when bookings happened.

With each step, I was getting closer to answers, but it wasn’t enough. I needed to find the features that directly or indirectly influenced the target column. The “session” table held untapped potential, and merging it with the “user_table” was like discovering hidden treasures.

My journey continues, full of challenges and surprises, but I’m ready to explore this data labyrinth. 🗺️🔍✨

3. Exploratory Data Analysis (EDA)

Imagine you’re a data detective with a magnifying glass, a fedora, and a sense of adventure. EDA is like your treasure hunt in the data jungle. 🕵️‍♂️🌴

The Fun Part: You get to uncover the data’s secrets, like a detective solving a mystery. You examine, visualize, and play with the data to reveal its hidden patterns and stories.

The Professional Part: But make no mistake, this is a serious investigation. EDA helps you understand your data inside out. You’re not just playing; you’re making data-driven decisions with style.

So, put on your detective hat, grab your magnifying glass (or code editor), and let’s dive into the data adventure. It’s a mix of fun and professionalism, just like any great detective story! 🔍📊🧐

Load Libraries :

Additional Section has details about these libraries and its usage.

Time to Unleash the Data Magic!

We’re all set to dive into the data ocean, but here’s the kicker — we’re not just importing numbers and text; we’re bringing time to the party! 🎉

As we load our CSV files, we’ll sprinkle a bit of data magic to ensure that our dates don’t get lost in translation. That’s right, we’re parsing those dates, giving them the respect they deserve. Why? Because time is essential in our journey to understand the world of Airbnb.

So, hold onto your hats, folks, because it’s about to get timey-winey in here! 🕰✨

df_test_users = pd.read_csv("test_users.csv", parse_dates=["date_account_created", "date_first_booking", "timestamp_first_active"])
df_train_users_2 = pd.read_csv("train_users_2.csv", parse_dates=["date_account_created", "date_first_booking", "timestamp_first_active"])
df_countries = pd.read_csv("countries.csv")
df_sessions = pd.read_csv("sessions.csv")
df_sample_NDF = pd.read_csv("sample_submission_NDF.csv")
df_age_gender = pd.read_csv("age_gender_bkts.csv")

Let’s Peek Behind the Data Curtains!

It’s time to unveil the secrets hidden within our data tables, and we promise it’s not just a boring old spreadsheet. 🎭

Shape: Picture this as the size of our data family — how many rows and columns they’ve got. It’s like finding out how big our treasure chest is!

Info: Think of this as our data’s ID card. It tells us what’s inside, like the data types and memory usage. It’s a bit like our data’s personality profile!

Head Records: It’s like flipping through the first few pages of a book. These records give us a taste of what’s inside. Just enough to make us curious and excited to dive deeper!

Our data tables are not just numbers and text; they’re like little stories waiting to be discovered. Let’s get started on this data adventure! 🕵️‍♂️📚✨

Author Image

Similar, we can run for each data frame to get started with our data treasure.

Data Frame Synopsys

Embracing the Data Challenge!

Our data adventure is about to get exciting! 🚀

The real stars of our show are the ‘sessions’ and ‘train user’ datasets. They hold the keys to our treasure trove of insights. But here’s the twist — there’s a thrilling mismatch between these datasets, a puzzle to solve!

With a whopping 10 million session records, we’re in for an epic adventure. It’s not for the faint-hearted, but we’re ready to conquer it with data-driven prowess.

Buckle up, because this is where the fun and the challenge come together in perfect harmony. 📊🔍🚁

def cat_colums(df):
for col in df.select_dtypes("object").columns:

if len(df[col].unique()) < 15:
print("-"*25, f"\n{col} Column\n", "-"*25)
display(((df[col].value_counts(normalize=True))*100).round(1))
else:
print("-"*25, f"\n{col} Column (Top 15)\n", "-"*25)
display(((df[col].value_counts(normalize=True)[:15])*100).round(1))

The above function checks for object type columns and displays only top 15 unique values in descending order.

Snapshot of few columns output

Similarly you can run for all the data frames and would get first hand insights of the data.

The Data Detective’s Journey Initial Insights

Our journey into the Airbnb data landscape has uncovered some fascinating discoveries. 🕵️‍♂️✨

Signup Method Analysis: It seems like “Basic” is the reigning champ, accounting for 75% of all signups. In fact, it and its buddy “Facebook” make up 99% of all signups. It’s the most popular duo in town!

Language Spoken Analysis: “en” takes the lead as the most spoken language, but don’t underestimate “zh,” “fr,” and “es” — they’re on a linguistic world tour!

Affiliate Channel Analysis: “Direct,” “SEM-Brand,” and “SEM-Non-Brand” are the traffic captains, guiding over 90% of website visitors. They know the way!

Affiliate Providers Analysis: When it comes to affiliate providers, “Direct” and “Google” are the dynamic duo at the top. They’ve got their affiliate game strong.

First Affiliate Tracked Analysis: While “Untracked” takes the lead, “Linked” and “OMG” sneak in with significant contributions. It’s a game of tracking, and they’re all in!

Signup Method by Device Analysis: “Web” takes the throne, with “iOS” as its loyal companion. Together, they lead the way in the signup parade.

First Device Type Analysis: “Mac” and “Windows Desktop” rule the device kingdom, with “iPhone” and “iPad” as their trusty sidekicks.

Country Destination Analysis: “NDF” (No Destination Found) is the mysterious explorer’s choice, followed by “US.” Looks like most users are here for a taste of adventure.

“Other” has a surprise entry in the top three, with “FR,” “IT,” “GB,” and “ES” following as the most preferred destinations.

Browser Usage Analysis: “Chrome” leads the browsing brigade, closely followed by “Safari” and “Firefox.” The “-unknown-” browser is the chameleon that can become “NaN.”

Gender Analysis: “-unknown-” rules the gender kingdom, but we’ll give it a makeover into “NaN.”

Action Column: “Show” takes the spotlight, indicating active property hunting. Users are on the prowl, and it’s a positive sign!

Action Type: “View,” “Data,” and “Click” are the MVPs, showing users are deep into property hunting mode. Messaging? Not so much.

Action Detail Column: “View Search Results” leads the way, followed by the mysterious “03” category. Even with limited info, it’s a frequent flyer.

Device Type: “Mac” and “Windows” are the tech titans, with “iPhone” in hot pursuit. Together, they conquer 90% of the device kingdom.

Gender Distribution: Gender equality is real in our data kingdom. It’s a balanced arena for all genders.

Country Destination: Our destinations are an equal-opportunity adventure. All genders have their passports ready!

Initial Findings

Age Group Analysis: Age knows no bounds; our counts are evenly spread among different age groups. A diverse community indeed.

Data Cleaning: Our data’s all spick and span. No messy spaces or lowercase capers here!

Duplicate Rows: The ‘sessions’ table has 252,536 duplicates, but the others are pure and pristine.

Missing Values Summary: Some missing pieces in our puzzle, but we’ll figure it out. “date_first_booking” and “age” have their secrets, but they’ll reveal them in time.

Outliers: Ah, the outliers! Our intriguing data also harbors some statistical rebels. They march to their own beat, waiting for us to understand their unique stories.

Our data detective journey continues, and the mysteries are waiting to be unraveled. Let’s keep exploring! 📊🕵️‍♂️🗺

Embarking on Data Quality Expedition

Now, we’re in the phase of unraveling data mysteries. 🕵️‍♂️

Detecting Duplicates: Think of it as finding identical twins in the data — duplicates that need to be resolved to avoid confusion.

Spotting Nulls: It’s like searching for hidden treasure chests. Null values can hold the key to missing information, and we’re on a quest to find and fill those gaps.

Outlier Hunt: Outliers are like rare gems in the data. We’re not just identifying them; we’re preparing to treat them with care and precision.

It’s a data adventure, full of challenges and rewards. Let’s continue our quest to ensure the data is in its best shape! 🔍📊🔥

In this step we would work on cleaning data.

Detecting Duplicates

Duplicate count for all Data frames

Only Session Data Set has duplicates around 2.5% only of data as duplicate and these can be imputed from the data without affecting the overall data.

Null values and outliers imputation might result in more duplicates. Therefore we will impute duplicates little later.

Outlier Hunt

Let check for the outliers in 2 of our major data frames train_users and sessions_data.

Output of train_user data set description

Notable Findings from the train_user Dataset

Gender Diversity: The dataset exhibits four unique gender types, with the majority categorized as “unknown.”

Age Discrepancy: An age anomaly is noted, with instances where the year is recorded as 2014. Additionally, the mean and standard deviation values surpass the 75th percentile, suggesting the presence of significant outliers.

Signup Methods: Among the various signup methods, only three are in use, with “Basic” emerging as the preferred choice.

Signup Flow Variability: The “signup_flow” feature displays substantial outliers, as indicated by the proximity of mean and standard deviation to the maximum value.

Language Diversity: The dataset encompasses 25 unique language types, with “English” being the most prominent.

Affiliate Dominance: “Direct” emerges as the dominant affiliate channel and provider.

Signup App Preferences: Among the signup app methods, three types are used, with “Web” being the most favored.

First Device and Browser Choices: “Mac” is the leading choice for the first device type, while “Chrome” is the favored browser.

Country Destination Diversity: The dataset encompasses 12 unique country destination types, with “NDF” representing the majority of contributions.

Output from sessions data set description

Notable Findings from the Sessions Dataset

Diverse Actions: The dataset boasts a total of 359 unique actions, showcasing the multitude of user interactions.

Action Typology: Within the actions, there are 10 distinct action types, shedding light on the various user behaviors.

Action Details: Delving deeper, there are 155 unique action details, offering nuanced insights into user activities.

Device Variety: Users accessed the platform via 14 different devices, indicating a diverse range of technologies used.

Outliers in Time: The “secs_elapsed” feature displays significant outliers, as evidenced by the mean and standard deviation values that greatly exceed the 75th percentile value.

Handling Corrupted Age Data in the age Feature

The age feature in the train_users dataset contains corrupted data, such as the value ‘1900’. To address this issue, we can impute the actual age by subtracting this value from the year the account was created. This assumption is based on the idea that some users may have only provided their birth year when filling in their age.

Steps to Impute Age

  1. Identify the corrupted age values (between ‘1900’ — ‘2000’) in the age feature.
  2. Extract the year from the date_account_created column, which represents the year the account was created.
  3. Calculate the actual age by subtracting the account creation year from the corrupted age value.
  4. Replace the corrupted age values with the calculated actual ages.
    This data preprocessing step will help us handle and correct the corrupted age data in the age feature for more accurate analysis and modeling.

Below custom function does the above steps.

We are not done with Age imputation yet as there is another critical issue to be fixed.

Age Data Compliance with User Policy

In adherence to Airbnb’s user policy, which stipulates that users must be a minimum of 18 years old, we are committed to upholding data quality and compliance. To achieve this, we shall identify and manage records where user ages fall below 18 or exceed 120.

The Age Alignment Procedure

Age Policy Enforcement: Our journey begins with a meticulous examination of the dataset. We will identify records where user ages do not adhere to the policy — those below 18 or above 120.

Data Cleansing

In line with our commitment to compliance and data quality, we will transform these non-compliant age values into NaN (missing values). This process reflects our dedication to aligning the data with both policy requirements and analytical needs.

By executing these steps, we ensure that the age data conforms to Airbnb’s user policy while maintaining the integrity and quality essential for robust analysis and modeling. 📜🔍🚀

Taming Data Mavericks with the IQR Shield

In the realm of data adventures, there are often rogue points that dare to deviate from the norm. But fear not, for we have a trusty shield — the Interquartile Range (IQR) method — to protect our data integrity. 🛡️📊

The Outlier Conqueror:

To address these outliers, we consider the following steps:

1. Identify Outliers: Use appropriate statistical techniques or visualizations (e.g., box plots or scatter plots) to identify data points that significantly deviate from the mean.

2. Outlier Handling: You can choose to handle outliers in one of the following ways:

  • Trimming: Remove the extreme outliers from the dataset.
  • Transformation: Apply mathematical transformations to the data (e.g., log transformation) to reduce the impact of outliers.
  • Binning: Group values into bins to reduce the impact of extreme values.
  • Imputation: Replace outliers with more typical values.

The Outlier’s Fate: Will be trimming our outliers as these are extreme values to maintain the sanctity of our data in our data adventure?

Meet the Percentiles: Our journey begins by getting acquainted with the 25th percentile (Q1) and the 75th percentile (Q3) of our data, marking the boundaries of the ordinary where IQR = Q3 — Q1.

IQR method is our steadfast companion, less sensitive to extreme data, and a guardian of data integrity. It ensures that our data remains a reliable and trustworthy ally in our analytical exploits. 📊🏰🌟

Custom function to remove outliers from the data set

def outlierresolution(data,column):
min,max=quartileMinMax(data, column)
shapebefore= data.shape[0]
data = data[(data[column] > min ) & (data[column]< max )]
shapeafter= data.shape[0]
outlierpercentag=np.round(((1-(shapeafter/shapebefore))*100),2)
print(f" {column} Data has {outlierpercentag} percentage outliers \n\n Before {shapebefore} \n After {shapeafter} \n")
return data

In our data journey, we’ve encountered a peculiar time traveler known as the “secs_elapsed” feature. This traveler doesn’t quite follow the rules, with a mean value of around 19,879.04 and a standard deviation of about 89,926.69. These values are like signposts pointing to significant outliers, those daring data points that defy the norm. 🕰️🚀

Executing outlier imputation for secs_elasped column of sessions data frame and holding the output in new data frame.

df_sessions2=outlierresolution(df_sessions, 'secs_elapsed')

Execution of Imputation

Armed with the knowledge of duplications and outlier treatment, we’re ready to execute imputation techniques. It’s time to decide how to handle these data doppelgängers, where we will banish these from our dataset using below code.

Dropped 2.5+% of records as these are very small number considering 10M rows.

Spotting Nulls

Rescuing the “-unknown-” Mysteries

In our data narrative, there are moments when enigmatic values, signified as “-unknown-”, make appearances. To shed light on these mysteries and ensure our data’s health and vitality, it’s time to replace these “-unknown-” enigmas with NaN (Not a Number). 🧐📊

The “-unknown-” Enigma Unveiled:

  1. Detective Work: Our adventure commences with a bit of detective work, locating every nook and cranny where these “-unknown-” characters have taken refuge.
  2. The Great Imputation: With our detective hats firmly in place, we perform the great imputation, swapping out those “-unknown-” figures with NaN. This transformation signifies that these entries are missing or devoid of valid information.

3. The Grand Strategy: Imputing with NaN serves as our grand strategy, ensuring that these mysteries are handled uniformly as missing data. This consistency sets the stage for diverse data analyses and modeling techniques to deal with them effectively.

Custom function to execute the above strategy for all the columns containing ‘-unknown-’ values.

Charting the Course:

After the “-unknown-” makeover, we chart the course for subsequent analyses or machine learning modeling, equipped with a dataset that treats missing or unknown data with the respect they deserve. This meticulous process paves the way for more reliable and precise results in our data odyssey. 🕵️‍♂️📊🗺️

Once you have imputed missing values, it’s important to check for and handle any duplicate entries to ensure data quality and consistency.

Session data set has still some duplicates. Lets truncate them as before.

4. Feature Engineering

In the vast landscape of data, we embark on a thrilling expedition — Feature Exploration. 🌟📊

Feature Engineering Expedition Highlights:

  1. Features Unveiled: Our journey commences with the grand revelation of the dataset’s features. Each one is like a unique treasure waiting to be explored.
  2. In-Depth Analysis: We delve into the intricate details, uncovering the nuances and secrets that each feature holds.
  3. Relationships Discovered: Along the way, we uncover the relationships between features, unveiling the hidden connections that shape the data’s story.
  4. The Quest for Insights: Our mission is to extract valuable insights, guiding us in making informed decisions and driving our analytical endeavors forward.

Deciphering Time’s Hidden Secrets

In the realm of data, time is a mystic force that holds the key to insights. It’s time for our Data Time Travelers to embark on a journey of Date Time Features Exploration, where we unlock the secrets concealed within the timestamps. 🕰️🌟

The Chronological Quest:

  1. Year Unveiling: We commence our expedition by unraveling the years hidden within the timestamps, allowing us to understand long-term trends and patterns.
  2. Month Mysteries: Next, we delve into the monthly mysteries, seeking to discover the ebb and flow of data behaviors across different months.
  3. Day of the Week Discovery: The days of the week hold their own stories. We decipher these stories to reveal how data behaves on different days.
  4. Hourly Insights: The hours and minutes are like the fine brushstrokes on the canvas of time. Exploring them brings out the granular details of data behaviors.

As we split time into its constituent parts, we unveil a tapestry of insights that guide our data analysis with precision and depth. It’s a journey where the art of data meets the science of time. 📅🔍🌌

There are records which have ‘NAT’ value in date time columns and this creates lot of issues while creating and plotting graphs and also in analyzing the data. Hence lets make it impute all records having this value to 0.

Lets work on creating columns that has lead time data between account created, first time stamp and booking dates. This will help us in identifying user behavior and activities.

We see in majority of the case (50% )the time difference between the first activity and booking date is 0. There are few cases where it took too much time and it is quite possible that few people take huge time to make a booking decision.

We see in majority of the case 50% the time difference between the first activity and booking date is 0. There are few cases where it took too much time and it is quite possible that few people take huge time to make a booking decision.

The negative number indicates that majority of the customers first were active on the application tried out few things and then create an account. There are few cases where it took much time and it is quite possible that few people took some time to create an account.

Handling Null Values in the Dataset

In the dataset, there are missing values in the first_affiliate_tracked and age features. To ensure the dataset is ready for analysis and modeling, it’s important to address these null values appropriately.

Handling Null Values in first_affiliate_tracked

Identify Null Values: Locate all records with missing values in the first_affiliate_tracked feature.

Imputation with “untracked”: Replace these missing values with “untracked.” This is a valid approach when no specific tracking information is available for these entries.

Handling Null Values in age:
Identify Null Values: Locate all records with missing values in the age feature.

Imputation with Median Value: To impute missing ages, calculate the median age from the available data and fill in the missing values with this median. Imputing with the median is a common practice as it minimizes the impact of outliers and helps maintain the data’s central tendency.

Data Integrity: Ensure that the imputed values align with the nature of the data and do not introduce biases into your analysis or modeling.

Handling null values in these features ensures that you have a complete and consistent dataset for your data analysis or modeling tasks.

Handling the Session Data Frame with Duplicates and Large Records

Managing a session data frame with duplicates and a large number of records can be challenging. To ensure data quality and manageability, consider the following steps:

  1. Group by id: Group the session data frame by the id column to consolidate data for each unique user.
  2. Aggregating seconds_elapsed: Calculate the mean of the seconds_elapsed for each user. This aggregation can provide a representative measure of time spent across multiple sessions.
  3. Enlist Strings as Lists: Sessions data frame contains string values that need to be aggregated, such as action, action types or details, enlisting them as lists. This will help maintain the sequence and order of events for each user.
  4. Handling Duplicates: Ensure that duplicates are handled during the grouping process. Removing duplicates after this process.
  5. Data Size Consideration: Important Note that with a large dataset, it’s essential to optimize memory usage and processing speed. Aggregating data can significantly reduce the size and complexity of the dataset.
  6. Data Integrity: Verify the data integrity and correctness of the aggregation results to ensure that your data accurately represents the user sessions.

Handling the session data frame with these steps allows you to obtain a more manageable and structured dataset for further analysis or modeling, especially when dealing with a large volume of records and duplicates.

Below custom code will rationalize the different browsers by merging most infrequently used browsers and one category for mobile browsers.

Once we merge the records rolling up to unique user_ids in session table it will help immensely reduce the number of records counts by 1/100th of the number in session table which would be very easy to mange than 10M.

enlist = lambda x: x.to_list()
agg_methods = {'secs_elapsed': 'mean', 'action': enlist, 'action_type': enlist, 'action_detail': enlist, 'device_type': enlist}
grouped_sessions = df_sessions.groupby('user_id').agg(agg_methods)
grouped_sessions.reset_index()
grouped_sessions.info()
grouped_sessions.head()

In the above code we are creating a new column which will let us identify whether the lead is still a prospect or customer (has done one booking ).

Age Grouping for Trend Insights

In the quest to uncover trends across different age groups, we’re about to create a new feature — Age Group. Think of it as organizing our data audience into distinct categories, each with its own unique story to tell. 📈📊

Age Grouping Process:

  1. Categorizing Ages: Our journey begins by categorizing ages into meaningful groups. This segmentation helps us understand how different age brackets behave.
  2. Trend Identification: With age groups in place, we can now identify trends and patterns that vary across these categories. It’s like studying different chapters of a story.
  3. Insightful Analysis: This feature empowers us to perform insightful analyses specific to each age group, revealing their preferences, behaviors, and tendencies.

Age grouping isn’t just about creating a new feature; it’s about adding a dimension to our data that unlocks a deeper understanding of the diverse audience it represents. 🕵️‍♂️📊📈

Creating a new feature age group so that we can bin the customers with different age group which will help us identify trends across the age groups.The country filter will help in checking for countries travelled other than NDF, US and OTHER

The Dance of Data Harmony

In the world of data, there’s a magical phenomenon known as Data Correlation — a dance where variables reveal their connections. 📊🌟

Correlation Chronicles:

  1. Variable Pairs: Our dance begins by pairing up different variables, inviting them to the floor to see how they move in harmony.
  2. The Synchronization: As the music plays, we observe how these variables move together. Are they in sync or do they have their own rhythm?
  3. The Strength of Connection: We measure the strength of their connection. Is it a strong tango or a gentle waltz?
  4. Patterns Revealed: In this dance, patterns and relationships between variables come to life. Some move together, some lead, and others follow.

The Art of Insights:

Data Correlation is not just a dance; it’s a masterclass in understanding the interplay between variables. It helps us make informed decisions, guide our analysis, and unveil the hidden stories within the data. 🕺💃🔗

Unraveling the Correlation Mysteries

In our journey through the data landscape, we’ve stumbled upon the intricate world of Correlation Mapping. It’s like a treasure map with clues, revealing the connections between data features. 🗺️🔍

The correlation does not give strong correlation relationship. Nonetheless, it provides below insights.

  1. Inherent Correlations: Year, month, and day of the week are inherently related. For example, the year influences the month, and the month influences the day of the week. This can result in high correlations among these features.
  2. Correlation Among Differences: When you take the differences between date components (e.g., the time elapsed between two events), these differences can exhibit correlations. For instance, the difference in years between two dates may correlate with the difference in months or days.
  3. Interpretability: While these correlations may exist, they might not provide clear and meaningful insights for analysis or modeling. High correlations among date components may not convey substantial information about the relationships between variables.
  4. Dimensionality: Breaking down dates into components increases the dimensionality of the dataset, which can impact the interpretability of correlation matrices.

Keep in mind that the choice of date representation and the handling of correlations should align with the objectives of your analysis or modeling.

The Data Fusion Ballet

In the world of data, we’re about to witness a spectacular performance — the merging of the Session and Train User datasets. It’s like a grand ballet where two datasets gracefully intertwine to create a harmonious whole. 🩰📊

Merging Movements:

  1. Data Alignment: Our performance begins with aligning the two datasets, ensuring that they’re ready to dance in perfect harmony.
  2. Data Fusion: As the music starts, we merge the Session dataset with the Train User dataset. It’s a moment of magic when their attributes come together.
  3. Unified Insights: The result is a unified dataset that carries the essence of both worlds. It’s like a duet where each dataset plays a crucial role.
  4. Infinite Possibilities: With this merged dataset, we unlock a world of possibilities for analysis and modeling. It’s a powerful ensemble, ready to tell us a compelling data story.

The fusion of these datasets is not just a performance; it’s a symphony of data where insights harmonize, creating a rich and layered narrative. 🎵💫📊

Merging Session and Train User dataset using left join.

Lets check for outliers for the merge dataset

Crafting Behavioral Insights

In our quest to understand customer behavior, we’re about to craft a set of new features that will illuminate the path ahead. It’s like forging a new tool to uncover the intricate patterns hidden within our data. 🔍📊

Feature Crafting Magic:

  1. New Feature Alchemy: Our journey begins with the alchemy of code. We create new features that count the unique string lengths within each row, for each column.
  2. Unveiling Behavior: These newly crafted features act like torches, shining a light on customer behavior. They reveal how customers interact with our data, what lengths of strings they prefer, and where patterns emerge.
  3. A World of Insights: With these features in hand, we open the door to a world of insights. We can now delve deeper into understanding customer behavior, drawing rich and meaningful conclusions.

This feature crafting isn’t just about code; it’s about empowering our data with the tools to decode the intricacies of customer behavior. 🛠️🎨🕵️‍♂️

5. Univariate Exploration in the Data Symphony

In the grand orchestra of data, it’s time for a solo performance — Univariate Analysis. Imagine each variable as a unique instrument, and we’re about to listen to their individual melodies. 🎶📊

Univariate Showcase

Variable Spotlight: Our performance begins by shining a spotlight on a single variable at a time. Each variable gets its moment to shine on the stage.

Melodic Insights: We listen to the solo, observing the distribution, range, and characteristics of each variable. It’s like appreciating the nuances of each instrument.

Data Notes: As the melodies play, we take notes on the patterns, outliers, and behaviors. These insights help us understand the variable’s story.

The Bigger Picture: Univariate analysis is a crucial part of the symphony. It’s like examining each musical note before we create a harmonious composition with all the variables.

This solo exploration enriches our understanding of the data, allowing us to appreciate the unique qualities of each variable before they come together in the grand data orchestra. 🕵️‍♂️🎻📊

User Demographic Analysis

  1. Female distribution is more Male.

2. Across the gender type average age is same.

  1. Age group 31–45 followed by 46–65 age groups are most inclined to us this app and do the bookings.
  2. We see even people over 65 uses this app and prefer to travel that is good indication. There might be some percentage of customers account where their children might have created an account on behalf of their parents.
  1. People generally prefer to travel locally in US and few of them preferred Europe as the next best options.

Apart from domestic travelling, customer prefer France, Italy and Great Britain as their first choice for international destinations.

  1. Interestingly weekday people prefer to create a account.

2. Across the years May to October months are the peak account created months.

Almost 75% actions, actions types have hardly taken more than a second. This also shows the stability and performance of the App as top notch.

6. Bivariate Analysis

Bivariate analysis — it’s like a dynamic duo exploring the data realm! 🕵️‍♂️🕵️‍♀️

Two’s Company In this thrilling data adventure, we’ll examine the connections and patterns between pairs of variables. It’s all about understanding how two different factors influence each other. 🤝🔍

Spotting Relationships We’ll uncover the hidden relationships, dependencies, and correlations that exist within the data. Think of it as Sherlock Holmes and Watson, but with data! 🕵️‍♂️🔍🧐

Seeing the Bigger Picture By studying the interactions between these pairs of variables, we gain a deeper insight into the story the data is trying to tell. It’s like solving a mystery, one correlation at a time. 📖🔍

Get Ready to Explore! Our data journey is about to get even more intriguing as we dive into the world of bivariate analysis. Get ready to uncover the secrets that data pairs hold! 🌟📊

In both scenarios whether the customer is preferring domestic or international below patterns are common.

  1. Female outclasses male in terms of preference to travel.
  2. France, Italy and Great Britain are preffered across gender types same as age group preference.
  1. France, Italy and Great Britain are preferred across age group preference.

2. 19–45 Age group has highest preference of European countries.

  1. People prefer first booking timestamp interestingly on weekends contrary Account creation preference.
  2. Again March to June are the Peak months for first booking activity.
  3. Year on Year there is exponential growth in terms of people usage activity, which is very important indicator that people are liking this app and referring others as well.
  4. Obviously, 6pm to 6am time of the day is peak time for first booking.

Uncovering Source Preferences

In our data journey, we’re embarking on a quest to understand the preferences of different source groups. 🚀

Destination: Preferred Targets Our destination is to find the most preferred targets for each source group, all while calculating the percentages to uncover the landscape of choices. 🌟

Example: Age Groups (Source) Picture it as travelers of different ages (source) choosing their preferred paths (targets) for a journey. We’ll calculate the percentage of each choice and unveil the top n destinations with the highest percentage and count. 🗺️📊

Adventure Begins! Let’s dive into the data wilderness and reveal the favored destinations of our diverse source groups. 🌍💼

Below is the custom code to get the most preferred target data based on source data.

# Group the DataFrame by source column and target column
# Calculate the percentage within each source group
# Find the most preferred target for each source group
def getonegroupcounts(df_train_users_2, id, source_col, target_col, n=3):
grouping = df_train_users_2.groupby([source_col, target_col])[id].count().reset_index()
grouping = grouping.rename(columns={id: 'count'})
grouping['percentage'] = grouping.groupby(source_col)['count'].transform(lambda x: (x / x.sum()) * 100).round(0)
preferred_response = grouping.sort_values(by='count', ascending=False).groupby(source_col).head(n)
preferred_response = preferred_response.sort_values(by=[source_col, 'count'], ascending=[True, False])
return (preferred_response)
df = df_train_users_2
id = 'id'
base_column = 'age_group'
getonegroupcounts(df, id,base_column,'country_destination')

Similarly we can get information about each of the target column based on base column=age_group, we can have gender or country destination also as source column.

getonegroupcounts(df, id,base_column,'affiliate_provider')
getonegroupcounts(df, id,base_column,'first_browser')
getonegroupcounts(df, id,base_column,'gender')
getonegroupcounts(df, id,base_column,'first_affiliate_tracked')
getonegroupcounts(df, id,base_column,'affiliate_channel')
getonegroupcounts(df, id,base_column,'language')
getonegroupcounts(df, id,base_column,'signup_method')
getonegroupcounts(df, id,base_column,'signup_flow')

7. Multivariate Analysis

Unraveling the Data Enigma: Multivariate Analysis - where data detectives gather all the clues for the full story! 🕵️‍♂️🕵️‍♀️🕵️

A Full Cast of Characters In this thrilling data adventure, we’re not just dealing with pairs; we’ve got a whole ensemble of variables to analyze. It’s like a detective team assembling for a complex case. 🕵️‍♂️🕵️‍♀️🔍

Detective Work in Full Swing We’re on a mission to uncover the intricate web of connections, dependencies, and interactions between multiple variables. It’s akin to solving a high-stakes mystery, where every piece of evidence matters. 🧩🔍🔦

Cracking the Case As data detectives, our goal is to piece together the complete story — to understand how various factors influence each other and the bigger picture they paint. It’s all about finding the truth hidden within the data. 📖🔍🧐

Gather Your Team Just like a detective needs a trusty team, we’ll rely on data analysis techniques and tools to crack this data case wide open. Get ready for a multivariate analysis adventure like no other! 🌟📊🔍

  1. When check across gender and age group the destination preference we see that female average age group is less than male for France but as the average age increases female prefer Great Britain. Male on contrary has no such preference.
  2. Interesting others prefers Italy and Canada more. These are people who are still on fence looking and trying out the app, once they make up their mind I think they fill all personal details.
  3. We can say people who fill all details are more seriously looking for travelling and would be prime customer type to use this application.

Exploring Sources

In our data adventure, we’re diving deep into the details with not one, but two sources! 🌟

Destination: Preferred Targets for Dual Sources Our goal is to discover the most preferred destinations for each dynamic duo of source groups. 🌍

Example: Age Group (Source1) and Gender (Source2) Imagine travelers of various ages (Source1) and genders (Source2) embarking on a quest to choose their ideal travel destinations. We’ll calculate the percentage of each choice, revealing the top n destinations with the highest percentage and count. 🗺️🚶‍♀️🚶

Ready for the Journey? It’s time to embark on this data expedition, where we’ll uncover the favored destinations of specific age and gender combinations. Let’s make this data adventure a memorable one! 🚀📊

Below is the custom code to get the most preferred target data based on source data1 and source data2.

Similarly we can get information about each of the target column based on base column=age_group, base2 column= gender we can have gender or country destination also as source column.

In our data expedition, we’re not just dealing with one source — we’ve got two sources bringing the excitement! 🌟

Destination: Top Picks for Dynamic Duos Our quest? Discover the most sought-after destinations for each fantastic duo of source groups, all while using the mean of the “mean seconds elapsed” column to guide us. 🗺️⏱️

For Example: Age Group (Source1) and Gender (Source2) Picture travelers of varying ages (Source1) and genders (Source2) teaming up to select their dream destinations. We’ll calculate the percentage of each choice, and then unveil the top n destinations with the highest percentage and count based on the mean seconds elapsed for each action methods.🚶‍♂️🚶‍♀️⏳

Ready for the Grand Adventure? It’s time to dive into the data landscape and uncover the favored destinations for specific age and gender combinations, all with the guidance of the mean seconds elapsed. Get ready for a data journey like no other! 🚀📊

8. Decoding the Clues: Feature Importance Analysis

Feature importance analysis — where we act as data detectives to uncover the key players in the data story! 🕵️‍♂️🕵️‍♀️🔍

The Lineup of Suspects In this thrilling data investigation, we’re tasked with identifying the most influential features in the data. It’s like sifting through a lineup of suspects to find the true culprits. 🕵️‍♂️🕵️‍♀️🔍

Evidential Analysis Our job is to collect the evidence and assess the impact of each feature on the overall data scenario. It’s akin to connecting the dots in a complex case to find the critical clues. 🧩🔍📊

Cracking the Data Case As data detectives, we aim to reveal the most important factors that shape the data’s narrative. It’s all about understanding the significant variables and their role in the bigger data picture. 📖🔍🧐

Our Trusted Tools Just like a detective relies on tools and techniques, we’ll use data analysis methods to unravel the feature importance mystery. Get ready for a data investigation that will unveil the data’s most critical players! 🌟📊🔍

Cramér’s V: Measuring the Strength of Association (Feature Importance)

Cramér’s V is a powerful metric for evaluating associations between two categorical variables. It quantifies the degree of dependence between them and is derived from the chi-squared (χ²) statistic. Cramér’s V allows us to understand the strength of this association, with interpretations as follows:

0: No association between variables.

0.1: Weak association.

0.3: Moderate association.

0.5: Strong association.

1: Perfect association.

Understanding Cramér’s V:

Contingency Table: It all starts by creating a contingency table. This table provides a visual representation of the frequencies of each combination of categories from the two categorical variables, offering insights into the relationships between them.

Chi-Squared Statistic: Cramér’s V is computed from the chi-squared (χ²) statistic. The χ² statistic measures the extent to which observed frequencies deviate from expected frequencies in the contingency table. A high χ² value suggests a substantial association between the variables.

Normalization: To obtain Cramér’s V, the > chi-squared statistic is normalized by the number of observations (n) and the minimum of the number of categories in each variable minus 1. This normalization scales the value between 0 and 1, making it easier to interpret.

Interpreting Cramér’s V:

The value of Cramér’s V directly reflects the strength of the association between variables. A Cramér’s V of 0 signifies no association, while increasing values indicate stronger associations. Cramér’s V is a versatile tool applicable across various fields, including social sciences, marketing, and data analysis. It empowers you to comprehend the relationships between categorical variables and facilitates informed decision-making based on the strength of these relationships.

When interpreting Cramér’s V results, consider the following:

Cramér’s V provides a standardized measure of association, enabling easy comparisons of association strength across different pairs of categorical variables.

A higher Cramér’s V suggests a more robust relationship between variables. The closer the value is to 1, the more perfect the association.

The importance of Cramér’s V in your analysis depends on the specific context of your research or business problem. For some issues, understanding the strength of association between variables is crucial, while for others, it may hold less significance.

In essence, Cramér’s V is a valuable instrument for evaluating the strength of associations between categorical variables. Its relevance in your analysis hinges on the particular research or business challenge you aim to address.

Custom Code

print(f'no_association      {no_association      }')
print(f'weak_association {weak_association }')
print(f'moderate_association{moderate_association}')
print(f'strong_association {strong_association }')
print(f'perfect_association {perfect_association }')

Association Strength (Feature Importance)

Weak Associated Columns with destination country target column

  • gender
  • signup method
  • language
  • affiliate channel
  • affiliate provider
  • first affiliate tracked
  • signup app
  • first device type
  • first browser
  • timestamp_first_active_year
  • timestamp_first_active_month
  • timestamp_first_active_day
  • timestamp_first_active_week_day
  • device_type_top_type

Moderate associated columns with destination country target column

  • timestamp_first_active_dt
  • device_type

Strong associated columns with destination country target column:

  • action
  • action_type
  • action_detail

Perfect associated columns with destination country target column:

  • id
  • country_destination
  • date_first_booking_category

9. Data Detective on the Case: Discovering Trends

Trends — we’re the data detectives solving the case of hidden patterns! 🕵️‍♂️🕵️‍♀️📈

Cracking the Data Code In this thrilling data investigation, we uncover data’s hidden secrets, like detectives solving a mysterious case. It’s about spotting patterns, analyzing clues, and predicting the future. 🔍🧩📈

Solving the Data Mystery Our mission is to unveil the recurring themes and insights that lurk within the data, just like detectives revealing the truth behind the mystery. 📖📊

Predictive Prowess By studying trends, we’re equipped to make informed decisions and foresee what’s coming next — it’s like having a detective’s intuition for data. 🔮📊

Stay One Step Ahead Join us on this data detective journey as we solve the data case, stay one step ahead of trends, and make informed decisions based on our findings! 🌟📈🕵️‍♂️

First Booking Inference

  • Weekdays are the most preferred days for booking.
  • Booking is most common between March and June (historically, May to October had the most bookings, but this shifted to March to June in 2014).
  • Booking trends are similar for people who have already booked their first destination and those who have not yet booked.

User Behavior and Preferences Inference

Most Preferred Action Types and Details

  • Action type 0 is the most preferred across age and gender, regardless of booking status.
  • Action details 0 and 9 are the most preferred by all age groups and genders.
  • Action types 0 and 2 are the most preferred by all age groups and genders.

User Behavior Insights

  • Users typically become active before creating an account.
  • The top five most common actions are:
  • Create
  • show active
  • active show
  • create header userpic
  • header userpic create

The top five most common action details

  • signup
  • create user header userpic
  • header userpic create user
  • message post
  • login header
  • Over 50% of users complete actions in 0 seconds.

Lead Time Inference

Lead time Analysis is same across booked customers and prospective customers.

Females take longer time duration than males to book their first trip

  • After the timestamp_first_active time: around 1200 seconds more
  • After account creation: around 1200+ seconds more

Booking Times and Patterns Inference

Account Creation

  • Weekdays are the preferred days for creating accounts.
  • Account creation is most common between March and June.
  • Account creation is increasing throughout the year and stabilizes around the 7th month.

Booking

  • Weekdays are the most preferred days for booking.
  • Booking is most common between March and June.
  • Booking has shifted from May to October to March to June in recent years.

First Timestamp Trends

  • Friday to Sunday, along with Tuesday, are the busiest days for using the app for the first time.
  • March to June is the most popular period for using the app for the first time.
  • The most suitable time for using the app for the first time is from 6 PM to 6 AM.

Destinations Travel Inference

Preferred travel destinations Inference

  • US (59%)
  • Other (41%)
  • France (31%)
  • Italy (17%)
  • Great Britain (14%)
  • Spain (other)

Gender preferences

  • Males and Females prefer to travel
  • Females are more active than males across all destination and age groups.
  • The 31–45 age group has the highest number of bookings, with US as most preferred destination followed by Others countries, France , Italy and Great Britain.
  • France is the most preferred destination across all age groups and gender after US.

User Demographics Inference

Gender Distribution

  • Unknown or Null Segment: The significant proportion of users falling under the “Unknown or Null” category (44.8%) highlights the need to improve data collection and user profiling. By better understanding this group, we can enhance personalization and engagement.
  • Balanced Gender Distribution: The nearly equal gender split, with 29.5% female and 25.5% male, indicates a broad user base. To ensure inclusive services and marketing, it is essential to cater to the preferences of both genders.
  • Preference Discrepancy: The preference discrepancy, where females slightly favor non-US destinations more than males, suggests the potential for gender-specific marketing strategies to enhance engagement and bookings.

Average Age by Gender

  • Uniform Average Age: The uniform average age of 35 across all genders (Male, Female, and Other) implies that age itself may not be a primary factor influencing user preferences. Businesses should consider a more holistic approach when developing marketing strategies.

Age Analysis

  • Age Distribution: Understanding the age distribution, with 55% falling in the 31–45 age group, is pivotal for targeting marketing efforts. This group’s preferences are critical for optimizing promotions and packages.
  • Preference Alignment: The shared preference for local US destinations across genders reinforces the importance of emphasizing domestic travel options.
  • Preferred Non-US Destinations: The preference for France, Italy, and Great Britain among both genders, with a slightly stronger inclination from females, highlights the opportunity for tailoring marketing strategies to accentuate these destinations, especially for female travelers.
  • Key Age Group (31–45): The highest app usage and a preference for France, Italy, and Great Britain within the 31–45 age group are noteworthy. This group should be a focal point for marketing initiatives.

By considering these insights, businesses can enhance personalization, refine marketing strategies, and optimize their offerings to cater to the diverse preferences of their user base. This approach can lead to increased user engagement and satisfaction.

Booking Channels and Devices Inference

Marketing Messages:

  • Age-Based Targeting: Tailor marketing messages based on age groups, with social affiliate tracking for users aged 19–45 and convenience-oriented affiliate tracking for users aged 45+.

Account Creation and Booking:

  • Optimal Timing: Focus marketing and onboarding efforts on weekdays and during March to June, aligning with user preferences.

First Timestamp:

  • Strategic Days: Concentrate marketing and support efforts on Fridays, Sundays, and Tuesdays, the preferred days for first app usage.
  • Seasonal Alignment: Adapt marketing strategies and feature releases to match the peak usage period from March to June.
  • Optimal Usage Hours: Consider targeted promotions during the evening and nighttime hours, from 6 PM to 6 AM.

Signup Method:

  • Streamlined Signup:- ** Streamline and optimize basic or Facebook signup methods, which are preferred by 99% of users.

Language:

  • English Dominance:- ** Prioritize English language options in content and communications, as 97% of users prefer it.

Affiliate Channel and Provider:

  • Channel and Provider Focus:- ** Strengthen partnerships with Direct and SEM-Brand channels, as well as Direct and Google providers, which enjoy 88–90% user preference.

First Affiliate Tracked:

  • Tracking Preferences:- ** Enhance collaborations with Linked and OMG, preferred affiliate tracking methods.

Signup App, Device Type, and Browser:

  • Optimized User Experience:- ** Optimize the user experience for preferred options such as the web, Mac desktop, and browsers like Chrome, Safari, and Firefox to boost user engagement and satisfaction.

10. Navigating the Data Frontier: Actionable business recommendation for Success

As we conclude our data-driven journey, we’ve unearthed vital insights that can guide your business strategy and enhance user experiences. 🚀📈

I propose the following strategic recommendations that will help business position themselves as the go-to destination for families who are looking for relaxing, exploring, and trying new cuisines.

  • Weekdays are the most preferred days for booking.
  • Booking is most prominent between March and Oct, but peak booking period is between March to June.

Target marketing messages to age groups:

  • 19–45: emphasize social aspects
  • 45+: focus on convenience and ease of use

Focus marketing efforts on weekdays between March and June:

  • Account creation
  • Booking

Target busiest days and most popular booking period for customer support and promotions:

  • Friday to Sunday, Tuesday
  • March to June

Make it easy to sign up using Basic and Facebook

  • Offer additional benefits for these methods

Ensure website and app are available in English and offer support in other languages

  • Chinese and French

Focus on Direct and SEM-Brand affiliate channels

  • Generate traffic to website and app

Work with Direct and Google affiliate providers

  • Promote products or services

Track first affiliate affiliations to better understand user behavior

  • 52% untracked, followed by Linked and OMG

Optimize website and app for web and iOS signup

  • Offer additional benefits for these methods

Optimize website and app for Mac desktop and Windows

  • Offer support for other devices, such as Android phones and tablets

Make website and app compatible with Chrome, Safari, and Firefox

  • Offer support for other browsers, such as Microsoft Edge and Internet Explorer

Destination Customization

  • Tailor marketing campaigns and service offerings to match the distinct preferences of travelers, with a strong emphasis on the U.S., “Other” destinations, France, Italy, and Great Britain.

Gender-Neutral and Gender-Specific Marketing

  • Implement a dual approach that recognizes the shared love for travel while empowering female travelers through specialized marketing content.

Age-Group Targeting

  • Develop targeted marketing strategies, promotions, and travel packages aimed at the 31–45 age group, considering their affinity for the U.S. and other favored destinations.

Continuous Analysis

  • Regularly monitor user preferences and adapt strategies in response to evolving trends in the travel landscape.

Account Creation

  • Weekday Registrations: Users favor weekdays for creating accounts. Targeted marketing and promotions on weekdays can boost sign-ups.
  • Seasonal Peaks: Account creation surges from March to June. Align promotions with this period for maximum impact.
  • Year-Long Growth: Sustained growth throughout the year requires consistent user acquisition efforts.

Booking

  • Weekday Bookings: Similar to account creation, users prefer weekdays for booking. Focus marketing efforts and real-time support on these days.
  • Seasonal Booking Peaks: Bookings peak between March and June. Tailor deals and offers to this seasonal trend.
  • Changing Patterns: Recent shifts from May to October to March to June indicate evolving preferences. Stay agile in adapting marketing strategies.

First Timestamp Trends

  • Peak Usage Days: Friday to Sunday and Tuesday are the busiest for first-time app usage. Plan customer support and promotions accordingly.
  • Seasonal Usage: Concentrate efforts between March and June to align with heightened user activity.
  • Optimal Usage Hours: Evening and nighttime usage (6 PM to 6 AM) should be optimized with tailored features and promotions.

Gender-Based Lead Time Disparities

  • Delayed Female Bookings: Females take longer than males to book their first trip. This insight implies a need for gender-specific approaches in the booking process to address potential barriers or concerns that may lead to delayed bookings for female users.
  • Timestamp-First Active: The approximately 1200-second delay in booking after the timestamp_first_active time for females underscores the importance of timely and personalized engagement strategies with female users. Targeted notifications and incentives could prompt quicker booking decisions.
  • Account Creation Impact: The extended lead time of around 1200+ seconds after account creation for females suggests a possible correlation between the account setup process and booking delays. Streamlining the account creation process, along with providing clear booking pathways, may help reduce this time gap.

Preferred Action Types and Details

  • Universal Preference for Action Type 0: Regardless of age and gender, action type 0 stands out as the most preferred choice for users, irrespective of their booking status. Leveraging this insight, we should ensure seamless access and execution of action type 0 to facilitate user engagement.
  • Action Details 0 and 9: The universality of action details 0 and 9 among all age groups and genders highlights their significance. These details should be integrated into our user interface design and workflow to cater to users’ preferences effectively.
  • Action Types 0 and 2: The common preference for action types 0 and 2 across age groups and genders underscores their importance. We should prioritize these action types in our service offerings and marketing strategies.

User Behavior Insights

  • Active Status Before Account Creation: The pattern of users becoming active before creating an account signifies the need to streamline the account creation process and provide engaging, active content early in the user journey.
  • Top Five Common Actions: Understanding the top five most common actions, such as “Create,” “Show Active,” “Active Show,” “Create Header Userpic,” and “Header Userpic Create,” allows us to tailor our platform’s layout and navigation to facilitate these actions for users.
  • Top Five Common Action Details: Recognizing the prominence of action details like “Signup,”

Preferred Booking Days

  • Weekday Preference: The preference for booking on weekdays suggests that users are actively planning their trips during their typical workweek. This insight should guide the timing of marketing campaigns and promotional offers, with a particular focus on weekdays.

Seasonal Booking Peaks

  • Seasonal Patterns: The historical data showing that booking is most common between March and June, with a shift from May to October to March to June in 2014, highlights the significance of seasonality in travel bookings. This insight provides an opportunity to align marketing efforts with these peak booking periods, offering tailored promotions and packages for travelers planning their trips during these months.

Consistency Across Booking Status

  • Consistent Trends: The similarity in booking trends for users who have already booked their first destination and those who have not yet booked indicates that these trends are not isolated to a specific user group. This consistency underscores the importance of considering these trends when developing marketing strategies, as they apply universally.

Click for Google Collab Complete Notebook Link

Next Steps — The Detective’s Toolkit:

Now, as we venture deeper into our data investigation, it’s time to unveil our secret weapons! Just like the clever detective who uses every trick in the book to crack the case, we’ll be deploying the likes of XGBoost, LightGBM, and CatBoost — our trusty sleuthing algorithms known for their unrivaled predictive accuracy.

These formidable tools are our Watson, our Sherlock, our Hercule Poirot — each with their unique flair for solving the classification mysteries hidden within our multi-category target variable. Stay tuned as we unravel the enigma, one prediction at a time! 🕵️‍♂️🔍

References:

  1. Kaggle : https://www.kaggle.com/competitions
  2. Google Colab : https://colab.research.google.com/
  3. Full Code: Google Collab Complete Notebook Link

--

--