Building ETL Pipeline With SQL

Enhancing Flight Data Management

Published in

Art of Data Engineering

4 min readDec 12, 2023

Effective data management is critical in the aviation industry for informed decision-making, operational efficiency, and a smooth passenger experience.

This project focuses on creating and improving Extract, Transform, and Load (ETL) pipelines using SQL capable of handling massive volumes of flight-related data.

Objective

The primary objective of this project is to

Implement ETL pipelines tailored to the aviation sector.
Extract valuable insights from diverse data sources.
Transform raw information into meaningful metrics.
Load the refined data into a structured and efficient data model.
Enhance data quality and reliability.

Key components

Extract phase: This ensures the extraction of relevant information while considering factors such as temporal aspects and data relationships.

Transform phase: This phase involves data manipulation and aggregation. It aims to meet basic reporting requirements and also provides in-depth analytics and performance insights.

Load phase: This involves efficiently loading the transformed data into a structured database.

Extract

This involves extracting relevant information, including flight ID, flight number, departure and arrival airports, departure and arrival date time, airline code, and airline name. Here, I used the JOIN operator to combine information from the flights_table and airlines_table.

Also, I extracted information about airports, including airport code, airport name, city, and country, from the airports_table.

Information about airlines, including airline code, airline name, and country, from the airlines_table was also extracted.



WITH RankedFlights AS (
    SELECT
        flight_id,
        flight_number,
        departure_airport_code,
        arrival_airport_code,
        departure_datetime,
        arrival_datetime,
        airline_code,
        ROW_NUMBER() OVER (PARTITION BY airline_code ORDER BY departure_datetime) AS flight_rank
    FROM
        flights_table
    WHERE
        departure_datetime >= '2021-01-01' AND
        arrival_datetime <= '2021-12-31'
)

SELECT
    rf.flight_id,
    rf.flight_number,
    rf.departure_airport_code,
    rf.arrival_airport_code,
    rf.departure_datetime,
    rf.arrival_datetime,
    rf.airline_code,
    a.airline_name,
    rf.flight_rank
FROM
    RankedFlights rf
JOIN
    airlines_table a ON rf.airline_code = a.airline_code;

-- Extracting airport information with the number of incoming and outgoing flights
WITH AirportFlights AS (
    SELECT
        a.airport_code,
        a.airport_name,
        a.city,
        a.country,
        COUNT(DISTINCT f1.flight_id) AS outgoing_flights,
        COUNT(DISTINCT f2.flight_id) AS incoming_flights
    FROM
        airports_table a
    LEFT JOIN
        flights_table f1 ON a.airport_code = f1.departure_airport_code
    LEFT JOIN
        flights_table f2 ON a.airport_code = f2.arrival_airport_code
    GROUP BY
        a.airport_code, a.airport_name, a.city, a.country
)

SELECT
    af.airport_code,
    af.airport_name,
    af.city,
    af.country,
    af.outgoing_flights,
    af.incoming_flights
FROM
    AirportFlights af;

-- Extracting airline information with the average delay for each airline
WITH AvgDelayPerAirline AS (
    SELECT
        airline_code,
        AVG(arrival_delay) AS avg_arrival_delay,
        AVG(departure_delay) AS avg_departure_delay
    FROM
        flights_table
    WHERE
        arrival_delay IS NOT NULL AND
        departure_delay IS NOT NULL
    GROUP BY
        airline_code
)

SELECT
    a.airline_code,
    a.airline_name,
    a.country,
    adpa.avg_arrival_delay,
    adpa.avg_departure_delay
FROM
    airlines_table a
LEFT JOIN
    AvgDelayPerAirline adpa ON a.airline_code = adpa.airline_code;

Data transformation

This involves manipulating and transforming the extracted data to meet the requirements of the destination system.

Here, I used CTEs to combine data from the flights_table, airlines_table, and airports_table to create a denormalized view of the flight data.

I aggregated and summarized the transformed data, calculating metrics such as total flights, average arrival and departure delays, maximum and minimum delays, and counts of delayed and on-time flights for each airline and city pair.

Also, the CASE statements were used to categorize flights as delayed or on-time based on the arrival delay.


WITH TransformedFlights AS (
    SELECT
        f.flight_id,
        f.flight_number,
        f.departure_airport_code,
        f.arrival_airport_code,
        f.departure_datetime,
        f.arrival_datetime,
        f.airline_code,
        a.airline_name,
        f.arrival_delay,
        f.departure_delay,
        a.country AS airline_country,
        ap1.city AS departure_city,
        ap1.country AS departure_country,
        ap2.city AS arrival_city,
        ap2.country AS arrival_country
    FROM
        flights_table f
    JOIN
        airlines_table a ON f.airline_code = a.airline_code
    JOIN
        airports_table ap1 ON f.departure_airport_code = ap1.airport_code
    JOIN
        airports_table ap2 ON f.arrival_airport_code = ap2.airport_code
    WHERE
        f.departure_datetime >= '2021-01-01' AND
        f.arrival_datetime <= '2021-12-31'
)

-- Aggregating and summarizing flight data
SELECT
    airline_code,
    airline_name,
    airline_country,
    COUNT(DISTINCT flight_id) AS total_flights,
    AVG(arrival_delay) AS avg_arrival_delay,
    AVG(departure_delay) AS avg_departure_delay,
    MAX(arrival_delay) AS max_arrival_delay,
    MIN(departure_delay) AS min_departure_delay,
    departure_city,
    arrival_city,
    COUNT(DISTINCT CASE WHEN arrival_delay > 0 THEN flight_id END) AS delayed_flights,
    COUNT(DISTINCT CASE WHEN arrival_delay <= 0 THEN flight_id END) AS on_time_flights
FROM
    TransformedFlights
GROUP BY
    airline_code, airline_name, airline_country, departure_city, arrival_city;

Loading phase

This involves inserting the transformed data into the destination database or data warehouse. Here, it involves inserting the data into a star schema or a fact-dimension model.

Here, ‘fact_flights’ is a fact table containing key metrics about flights; ‘dim_airlines’ is a dimension table containing information about airlines; and ‘dim_airports’ is a dimension table containing information about airports.
The query below demonstrates the loading process into the star schema.



-- Loading transformed flight data

-- Loading into the fact table
INSERT INTO fact_flights (
    flight_id,
    flight_number,
    departure_airport_code,
    arrival_airport_code,
    departure_datetime,
    arrival_datetime,
    airline_code,
    arrival_delay,
    departure_delay
)
SELECT
    flight_id,
    flight_number,
    departure_airport_code,
    arrival_airport_code,
    departure_datetime,
    arrival_datetime,
    airline_code,
    arrival_delay,
    departure_delay
FROM
    TransformedFlights;

-- Loading into the airline dimension table (assuming it doesn't exist yet)
INSERT INTO dim_airlines (airline_code, airline_name, country)
SELECT DISTINCT
    airline_code,
    airline_name,
    airline_country
FROM
    TransformedFlights
ON CONFLICT (airline_code) DO NOTHING; -- Using ON CONFLICT to avoid duplicate entries

-- Loading into the airport dimension table
INSERT INTO dim_airports (airport_code, airport_name, city, country)
SELECT DISTINCT
    airport_code,
    airport_name,
    departure_city,
    departure_country
FROM
    TransformedFlights
UNION
SELECT DISTINCT
    airport_code,
    airport_name,
    arrival_city,
    arrival_country
FROM
    TransformedFlights
ON CONFLICT (airport_code) DO NOTHING; -- Using ON CONFLICT to avoid duplicate entries

Building an ETL (Extract, Transform, Load) pipeline with SQL offers a powerful and efficient solution for managing and transforming data within organizations. The use of SQL provides efficient interaction with databases, making it a versatile tool for handling diverse data sources and formats.

Thank you for reading! If you found this article interesting, do consider following me and subscribing to my latest articles. Catch me on LinkedIn and follow me on Twitter