Feature Engineering with SQL: Painting a Canvas with Data

11 min readSep 20, 2023

Today, we are embarking on a journey through the fascinating world of data analysis, where we will unlock the secrets behind one of the most vital processes in data science: Feature Engineering. This is the art of selecting and transforming variables to create a rich, detailed dataset that can provide deeper insights and drive more effective decisions.

Before we dive into the heart of our topic, let’s take a moment to understand what feature engineering is and why it’s such a big deal. Imagine you are an artist, and you are given a palette of colours. Feature engineering is the process of mixing and combining those colours in various ways to create a beautiful painting — a portrait of your data that tells a compelling story.

In the realm of data science, feature engineering is used to refine and optimize raw data, helping to reveal hidden patterns and subtle nuances that can be leveraged to create more effective algorithms and models. It involves creating new features or modifying existing ones to better represent the underlying patterns in the data.

The Applications and Tools

Beyond crafting personalized perks for a travel startup’s customer segmentation, which is our central theme today, feature engineering has a wide range of applications, such as:

Fraud Detection: Helping in identifying unusual patterns and anomalies, thus playing a pivotal role in fraud detection systems.
Healthcare: Aiding in predicting diseases by analyzing various health parameters, thus being a lifesaver in the truest sense.
AI and Machine Learning: Acting as the backbone in the development of many AI and machine learning models, enhancing predictive accuracy and optimizing results across industries.

As for the tools, while we are showcasing the prowess of SQL today, it isn’t alone in this journey. Here are some other tools that stand tall in the field of feature engineering:

Python: Accompanied by libraries like Pandas and Scikit-learn, it offers a rich ecosystem for data analysis and feature engineering.
R: Especially revered in the statistical analysis landscape, it’s another fantastic tool in the arsenal.
Excel and Tableau: These tools aid in the preliminary stages of data exploration and visualization, making the process of feature engineering more intuitive and insightful.

Now, with a fresh palette of knowledge about feature engineering, let’s proceed to use SQL to paint a masterful portrait of data.

Understanding the Dataset

To begin with, let’s understand the context of the SQL query by taking a closer look at TravelTide — a fictional travel booking platform. The platform has initiated a rewards program to increase customer engagement. This program offers various exciting perks including free hotel meals, free checked bags, no cancellation fees, exclusive discounts, and a 1-night free hotel with a flight.

To tailor these perks effectively, we focus on a chosen cohort of users who have actively engaged with the platform, boasting more than seven sessions since January 4, 2023. This ensures a rich tapestry of substantial behavioural data to craft the rewards.

The dataset is stored in four primary tables:

Users: This table encompasses user demographics including unique ID, gender, marital status, home location, and details regarding their children, among others.
Sessions: Details of individual browsing sessions are captured here, recording vital data points such as session duration, discounts offered, and booking information, to name a few.
Flights: Here, we find specifics of the purchased flights, including the departure and return details, airline information, and fare details.
Hotels: This section contains all the information about the hotel stays booked by the users, including check-in/check-out times and the cost of stay.

With this solid foundation, we are ready to dive into the SQL query, eager to uncover the insights hidden in the data through the lens of feature engineering.

Crafting the Cohort

In the first part of the query, a cohort of users is defined based on their session data. We’re looking at users who have had more than seven sessions since a specific date. It’s like gathering a group of the most engaged travellers to understand their preferences better.

-- Cohort definition:
WITH cohort_users AS (
  SELECT user_id
  FROM sessions
  WHERE session_start > '2023-01-04'
  GROUP BY user_id
  HAVING COUNT(session_id) > 7
),

Going the Distance with Vincenty Formula

Next up in our SQL query is a series of computations involving the Vincenty formula, a precise method to calculate the distance between two points on the surface of an ellipsoid. It’s like a GPS system pinpointing the exact distance between two locations on Earth, which, in our case, are the airports from where and to where the flights are booked.

To paint a clear picture, let’s break down the various steps involved in this meticulous calculation:

Setting the Stage with VincentyDistance: In this initial step, we define constants representing the Earth’s dimensions and then convert the latitudes and longitudes of the home and destination airports from degrees to radians. These constants are based on the WGS-84 ellipsoid parameters, providing us with the most accurate representation of the Earth’s shape.

-- Using for calculating the distance between two airports later:
VincentyDistance AS (
    SELECT
        s.user_id,
        f.trip_id,
        -- Constants for WGS-84 ellipsiod parameters
        6378137.0 AS a, -- semi-major axis in meters
        6356752.3142 AS b, -- semi-minor axis in meters
        1/298.257223563 AS f, -- flattening
        -- Convert degrees to radians for latitudes and longitudes
        RADIANS(u.home_airport_lat) AS lat1,
        RADIANS(u.home_airport_lon) AS lon1,
        RADIANS(f.destination_airport_lat) AS lat2,
        RADIANS(f.destination_airport_lon) AS lon2
    FROM sessions AS s
    JOIN flights AS f ON s.trip_id = f.trip_id
    JOIN users AS u ON s.user_id = u.user_id
    WHERE s.user_id IN (SELECT user_id FROM cohort_users)
),

2. Kickstarting Calculations with VincentyInitialComputations: Here, we delve deeper, calculating the difference in longitude between the two locations and initiating the Vincenty formula with a series of trigonometric calculations involving the latitudes of the two points. This is where we start to see the formula coming to life, setting the stage for more detailed computations.

-- Using for claculating the distance:
VincentyInitialComputations AS (
    SELECT
        user_id,
        trip_id,
        a,
        b,
        f,
        lat1,
        lon1,
        lat2,
        lon2,
        -- Compute delta longitude
        lon2 - lon1 AS L,
        ATAN((1 - f) * TAN(lat1)) AS U1,
        ATAN((1 - f) * TAN(lat2)) AS U2
    FROM VincentyDistance
),

3. Diving Deeper with VincentyIntermediateComputations: This step is like the heart of the computation process, where we use the values calculated in the previous step to compute more intermediate values, including sinSigma and cosSigma. These values are derived through a series of mathematical operations involving the trigonometric attributes of the geographical coordinates. It is here that the true genius of the Vincenty formula starts to shine, paving the way for the final calculations.

-- Using for claculating the distance:
VincentyIntermediateComputations AS (
    SELECT
        user_id,
        trip_id,
        a,
        b,
        f,
        lat1,
        lon1,
        lat2,
        lon2,
        L,
        U1,
        U2,
        SQRT(
            (COS(U2)*SIN(L)) * (COS(U2)*SIN(L)) +
            (COS(U1)*SIN(U2) - SIN(U1)*COS(U2)*COS(L)) * 
(COS(U1)*SIN(U2) - SIN(U1)*COS(U2)*COS(L))
        ) AS sinSigma,
        SIN(U1)*SIN(U2) + COS(U1)*COS(U2)*COS(L) AS cosSigma
    FROM VincentyInitialComputations
),

4. The Final Lap with VincentyComputations: In this concluding step, we calculate several more parameters, including sinAlpha and cos2Alpha, which are vital in determining the distance. It is a series of intense calculations that bring us closer to finding the exact distance between the two points. The meticulous details involved in this step ensure that the formula can provide the most accurate distance, accounting for the Earth’s elliptical shape.

-- Using for claculating the distance:
VincentyComputations AS (
    SELECT
        user_id,
        trip_id,
        a,
        b,
        f,
        lat1,
        lon1,
        lat2,
        lon2,
        L,
        U1,
        U2,
        sinSigma,
        cosSigma,
        COS(U1)*COS(U2)*SIN(L) AS sinAlpha,
        SQRT(1 - COS(U1)*COS(U2)*SIN(L)*COS(U1)*
COS(U2)*SIN(L)) AS cos2Alpha,
        cosSigma - 2*SIN(U1)*SIN(U2) / 
(SQRT(1 - COS(U1)*COS(U2)*SIN(L)*COS(U1)*COS(U2)*SIN(L))) AS cos2SigmaM,
        a*1.0/(1.0-f) AS u_sq
    FROM VincentyIntermediateComputations
),

By sequentially progressing through these computational steps, each building upon the last, we are crafting a powerful tool that will allow us to find the most precise distances between airports. This series of “WITH” clauses work in harmony, like a well-conducted orchestra, to bring the Vincenty formula to life in our SQL environment, setting the stage for the grand finale where we aggregate and merge the data to paint the full picture.

The Heart of the Query: Aggregating and Merging Data

At the heart of our analytical expedition is the careful aggregation and merging of data, where we craft a rich palette of features painting a detailed picture of our users’ interactions with the TravelTide platform. Let’s dissect this segment step by step:

User Sessions and Engagement

session_count: Here we count the total number of distinct sessions a user has had, giving us a glimpse into how frequently they interact with the platform.
avg_session_duration_minute: We calculate the average session duration, in minutes, to understand the typical length of time users spend per session.
avg_page_clicks: By averaging the number of page clicks across all sessions, we get a sense of users’ browsing behaviour.

-- Aggregate and Merge Data
aggregated_data AS (
    SELECT
        s.user_id,
     -- total number of user's sessions
        COUNT(DISTINCT s.session_id) AS session_count,
     -- average duration of sessions in minute
     ROUND(AVG(EXTRACT(MINUTE FROM (session_end - session_start))),2) 
           AS avg_session_duration_minute,
     -- average number of clicks in all browsing sessions
     ROUND(AVG(page_clicks),2) AS avg_page_clicks,

Booking Behavior

total_trips: We identify the total number of trips booked, discounting cancellations, to understand users’ booking tendencies.
conversion_rate: This metric reveals the effectiveness of the platform in converting browsing sessions into bookings, offering a lens into the booking propensity of users.
cancellation_proportion: Calculating the proportion of booked trips that were cancelled helps in understanding the users’ indecisiveness or change in plans.

-- total number of booked trips
     COUNT(DISTINCT CASE WHEN NOT cancellation THEN s.trip_id END) 
          AS total_trips,
/* conversion rate, dividing the number of booked trips 
(in case of no cancellation) by total number of browising sessions */
     ROUND(
          CASE WHEN COUNT(DISTINCT s.session_id) > 0 THEN 
        1.0 * COUNT(DISTINCT CASE WHEN NOT cancellation THEN s.trip_id END)
 / COUNT(DISTINCT s.session_id) ELSE 0 END
          ,2) AS conversion_rate,
 /* Cancellation proportion, returns NULL for users who didn't book 
any trip to not get division by zero error */
     ROUND(
          1.0 * COUNT(DISTINCT CASE WHEN cancellation THEN s.trip_id END) /
          NULLIF(COUNT(DISTINCT CASE WHEN NOT cancellation THEN 
          s.trip_id END), 0),
          2) AS cancellation_proportion,

Booking and Departure Gap

avg_booking_departure_gap_days_flights/hotels: These attributes delve into the average time gap between booking and the actual departure (for flights) or check-in (for hotels), unravelling the booking habits of the users in terms of how much in advance they plan their trips. It could be useful for offering a No Cancellation Fees perk to last-minute bookers, as they might be uncertain travellers due to different reasons.

/* calculating the booking to departure time gap in seconds and 
then days by dividing by 86400 */
     ROUND(
         AVG(EXTRACT(EPOCH FROM (f.departure_time - s.session_end)) 
              / 86400), 
          2) AS avg_booking_departure_gap_days_flights,
/* As some users only booked hotels, I add another calculation considering 
hotel check_in_time and later in Python will merge these two columns */
     ROUND(
         AVG(EXTRACT(EPOCH FROM (h.check_in_time - s.session_end)) / 86400), 
          2) AS avg_booking_departure_gap_days_hotels,

Flight Details

total_flights_booked: It denotes the total number of flights booked by users, giving insight into their preference for flight bookings.
avg_flight_seats and avg_checked_bag: These fields give us the average number of seats booked and checked_bags in flights, respectively, a good indicator of group travellers.
weekend_trip_proportion: Here we focus on weekend gateways, identifying the proportion of trips that are short and span a weekend, to categorize users based on their weekend travel preferences.
round_trips_proportion: This attribute focuses on the proportion of round trips booked, shedding light on users’ preferences for return bookings.
avg_distance_flown_km: Utilizing the Vincenty formula, we calculate the average distance flown, giving a lens into the typical flight distances preferred by users.

-- total number of flights
        COUNT(DISTINCT CASE WHEN flight_booked THEN s.trip_id END) 
              AS total_flights_booked,
-- average number of booked flight seats
        ROUND(AVG(seats),2) AS avg_flight_seats,
-- average number of checked bags in flights
        ROUND(AVG(checked_bags),2) AS avg_checked_bags,
/* Weekend trips proportion, to distinguish weekened gateway 
   travelers, when the departure time is on Fridays or Saturdays, 
   and return_time is on Sundays or Mondays and the duration of 
   the trip is less than three days */
          ROUND(
          CASE WHEN COUNT(DISTINCT CASE WHEN NOT cancellation 
          THEN s.trip_id END) > 0 THEN
          1.0 * COUNT(DISTINCT CASE WHEN EXTRACT(DOW FROM departure_time) 
          IN (5,6) AND return_flight_booked IS TRUE 
          AND EXTRACT(DOW FROM return_time) IN (0,1) 
          AND EXTRACT(DAY FROM (return_time - departure_time)) < 3
          THEN f.trip_id ELSE NULL END) / 
                  COUNT(DISTINCT CASE WHEN NOT cancellation THEN 
          s.trip_id END) ELSE 0 END,
          2) AS weekend_trip_proportion,
-- Round trips proportion, users who booked two ways flights
          ROUND(
              CASE WHEN COUNT(DISTINCT CASE WHEN flight_booked 
              THEN s.trip_id END) > 0 THEN
              1.0 * COUNT(DISTINCT CASE WHEN return_flight_booked 
              THEN s.trip_id END) /  
                COUNT(DISTINCT CASE WHEN flight_booked THEN s.trip_id END) 
              ELSE 0 END,
                  2) AS round_trips_proportion,
/* Vincenty formula for average distance between airports 
 (average distance flown in km) */
          ROUND(
             AVG(
               v.a * ATAN2(
                 SQRT((v.cos2Alpha)*(v.sinSigma*v.sinSigma)),
                 v.cosSigma - v.f*v.cos2Alpha*(v.cos2SigmaM)
                  )/1000)::NUMERIC, 2
                ) AS avg_distance_flown_km,

Hotel Details

total_hotels_booked: It denotes the total number of hotels booked by users, giving insight into their preference for hotel bookings.
avg_hotel_rooms: This field gives us the average number of rooms booked in hotels, portraying the users’ preference for comfort and space, and also the indicator of group travellers.
avg_stay_duration_day: We calculate the average duration of hotel stays to understand users’ preferences for shorter or longer stays.

-- total number of booked hotels
     COUNT(DISTINCT CASE WHEN hotel_booked THEN s.trip_id END) 
            AS total_hotels_booked,
-- average number of rooms in booked hotels
     ROUND(AVG(rooms),2) AS avg_hotel_rooms,
-- average duration of hotel stays in days
     ROUND(AVG(EXTRACT(DAY FROM (check_out_time - check_in_time))),2) 
           AS avg_stay_duration_day

Financial Details

This part would be useful later for distinguishing price-sensitive customers to offer them the Exclusive Discount perk.

avg_flight_price_usd and avg_hotel_price_usd: These capture the average amount spent on flight bookings and hotel stays, respectively, offering insights into users’ spending patterns.
avg_flight_discount_amount and avg_hotel_discount_amount: Here we look at the average discount amount availed for flight and hotel bookings, respectively.
discounted_flight_proportion and discounted_hotel_proportion: These reveal the proportion of bookings where users availed discounts, indicating their penchant for savings.

-- average flight price 
     ROUND(AVG(base_fare_usd),2) AS avg_flight_price_usd,
-- average hotel price
     ROUND(AVG(hotel_per_room_usd),2) AS avg_hotel_price_usd,
-- average flight discount amount
     ROUND(AVG(flight_discount_amount),2) AS avg_flight_discount_amount,
-- average hotel discount amount
     ROUND(AVG(hotel_discount_amount),2) AS avg_hotel_discount_amount,
-- discounted flights proportion
     ROUND(SUM(CASE WHEN flight_discount THEN 1 ELSE 0 END) :: NUMERIC
     / COUNT(*),2) AS discounted_flight_proportion,
-- discounted hotel proportion 
     ROUND(SUM(CASE WHEN hotel_discount THEN 1 ELSE 0 END) :: 
     NUMERIC / COUNT(*),2) AS discounted_hotel_proportion,

Merging the Insights

In this segment of the query, we perform a series of LEFT JOIN operations to merge data from different tables based on specific common identifiers.

FROM sessions AS s
    LEFT JOIN flights AS f ON s.trip_id = f.trip_id
    LEFT JOIN hotels AS h ON s.trip_id = h.trip_id
   LEFT JOIN users AS u ON s.user_id = u.user_id
    LEFT JOIN VincentyComputations AS v ON s.trip_id = v.trip_id
    WHERE s.user_id IN (SELECT user_id FROM cohort_users)
    GROUP BY s.user_id
)

Users' interactions and preferences are meticulously recorded to create a personalized rewards program.

The Final Picture: Selecting the Features

In the final part of the query, we bring together all the features we created into a final dataset, including additional information from the users’ profiles. This final dataset is like a gallery of portraits, with each portrait providing a detailed depiction of a user, ready to be explored and understood in depth.

-- Final Selection
SELECT 
  -- user demographic information
    u.user_id,
    u.sign_up_date,
  EXTRACT(YEAR FROM AGE(u.birthdate)) AS age,
    u.gender,
    u.married,
    u.has_children,
    u.home_country,
    u.home_city,
    -- browsing sessions info
    ad.session_count,
    ad.avg_session_duration_minute,
    ad.avg_page_clicks,
    -- booking behaviour
    ad.total_trips,
    ad.conversion_rate,
    ad.weekend_trip_proportion,
    ad.cancellation_proportion,
    ad.avg_booking_departure_gap_days_flights,
    ad.avg_booking_departure_gap_days_hotels,
    -- booked flights info
    ad.total_flights_booked,
    ad.round_trips_proportion,
    ad.avg_flight_price_usd,
    ad.avg_flight_discount_amount,
    ad.discounted_flight_proportion,
    ad.avg_flight_seats,
    ad.avg_checked_bags,
    ad.avg_distance_flown_km,
    -- booked hotels info
    ad.total_hotels_booked,
    ad.avg_hotel_price_usd,
    ad.avg_hotel_discount_amount,
    ad.discounted_hotel_proportion,
    ad.avg_hotel_rooms,
    ad.avg_stay_duration_day
FROM users AS u
JOIN aggregated_data AS ad ON u.user_id = ad.user_id;

Conclusion: The Artistry Behind the Query

We have used SQL to perform feature engineering on raw data, which has helped us uncover valuable insights for customer segmentation. The marketing team can now leverage these insights to create personalized experiences for each group of travellers, making their journeys more memorable and enjoyable.

Remember, behind every great data-driven decision, there is a great feature engineering process. It is the unsung hero in the data science world, working quietly behind the scenes to create a rich, detailed, and nuanced view of the world.

🔍 Check out the full SQL query on my GitHub Gist

If you’re eager to delve deeper into the world of customer segmentation, I invite you to read my article, “Customer Segmentation with Fuzzy Segmentation: Using SQL and Python.” In this article, we dive into the fascinating field of fuzzy segmentation, building upon the final dataset resulting from the SQL query explained in this article.

Thank you for journeying with us through this artistic exploration of data. Stay curious, and happy data painting!