Building ETL Pipeline With SQL

Enhancing Flight Data Management

Nnamdi Samuel
Art of Data Engineering
4 min readDec 12, 2023

--

Photo by Avel Chuklanov on Unsplash

Effective data management is critical in the aviation industry for informed decision-making, operational efficiency, and a smooth passenger experience.

This project focuses on creating and improving Extract, Transform, and Load (ETL) pipelines using SQL capable of handling massive volumes of flight-related data.

Objective

The primary objective of this project is to

  • Implement ETL pipelines tailored to the aviation sector.
  • Extract valuable insights from diverse data sources.
  • Transform raw information into meaningful metrics.
  • Load the refined data into a structured and efficient data model.
  • Enhance data quality and reliability.

Key components

Extract phase: This ensures the extraction of relevant information while considering factors such as temporal aspects and data relationships.

Transform phase: This phase involves data manipulation and aggregation. It aims to meet basic reporting requirements and also provides in-depth analytics and performance insights.

Load phase: This involves efficiently loading the transformed data into a structured database.

Extract

This involves extracting relevant information, including flight ID, flight number, departure and arrival airports, departure and arrival date time, airline code, and airline name. Here, I used the JOIN operator to combine information from the flights_table and airlines_table.

Also, I extracted information about airports, including airport code, airport name, city, and country, from the airports_table.

Information about airlines, including airline code, airline name, and country, from the airlines_table was also extracted.



WITH RankedFlights AS (
SELECT
flight_id,
flight_number,
departure_airport_code,
arrival_airport_code,
departure_datetime,
arrival_datetime,
airline_code,
ROW_NUMBER() OVER (PARTITION BY airline_code ORDER BY departure_datetime) AS flight_rank
FROM
flights_table
WHERE
departure_datetime >= '2021-01-01' AND
arrival_datetime <= '2021-12-31'
)

SELECT
rf.flight_id,
rf.flight_number,
rf.departure_airport_code,
rf.arrival_airport_code,
rf.departure_datetime,
rf.arrival_datetime,
rf.airline_code,
a.airline_name,
rf.flight_rank
FROM
RankedFlights rf
JOIN
airlines_table a ON rf.airline_code = a.airline_code;

-- Extracting airport information with the number of incoming and outgoing flights
WITH AirportFlights AS (
SELECT
a.airport_code,
a.airport_name,
a.city,
a.country,
COUNT(DISTINCT f1.flight_id) AS outgoing_flights,
COUNT(DISTINCT f2.flight_id) AS incoming_flights
FROM
airports_table a
LEFT JOIN
flights_table f1 ON a.airport_code = f1.departure_airport_code
LEFT JOIN
flights_table f2 ON a.airport_code = f2.arrival_airport_code
GROUP BY
a.airport_code, a.airport_name, a.city, a.country
)

SELECT
af.airport_code,
af.airport_name,
af.city,
af.country,
af.outgoing_flights,
af.incoming_flights
FROM
AirportFlights af;

-- Extracting airline information with the average delay for each airline
WITH AvgDelayPerAirline AS (
SELECT
airline_code,
AVG(arrival_delay) AS avg_arrival_delay,
AVG(departure_delay) AS avg_departure_delay
FROM
flights_table
WHERE
arrival_delay IS NOT NULL AND
departure_delay IS NOT NULL
GROUP BY
airline_code
)

SELECT
a.airline_code,
a.airline_name,
a.country,
adpa.avg_arrival_delay,
adpa.avg_departure_delay
FROM
airlines_table a
LEFT JOIN
AvgDelayPerAirline adpa ON a.airline_code = adpa.airline_code;

Data transformation

This involves manipulating and transforming the extracted data to meet the requirements of the destination system.

Here, I used CTEs to combine data from the flights_table, airlines_table, and airports_table to create a denormalized view of the flight data.

I aggregated and summarized the transformed data, calculating metrics such as total flights, average arrival and departure delays, maximum and minimum delays, and counts of delayed and on-time flights for each airline and city pair.

Also, the CASE statements were used to categorize flights as delayed or on-time based on the arrival delay.


WITH TransformedFlights AS (
SELECT
f.flight_id,
f.flight_number,
f.departure_airport_code,
f.arrival_airport_code,
f.departure_datetime,
f.arrival_datetime,
f.airline_code,
a.airline_name,
f.arrival_delay,
f.departure_delay,
a.country AS airline_country,
ap1.city AS departure_city,
ap1.country AS departure_country,
ap2.city AS arrival_city,
ap2.country AS arrival_country
FROM
flights_table f
JOIN
airlines_table a ON f.airline_code = a.airline_code
JOIN
airports_table ap1 ON f.departure_airport_code = ap1.airport_code
JOIN
airports_table ap2 ON f.arrival_airport_code = ap2.airport_code
WHERE
f.departure_datetime >= '2021-01-01' AND
f.arrival_datetime <= '2021-12-31'
)

-- Aggregating and summarizing flight data
SELECT
airline_code,
airline_name,
airline_country,
COUNT(DISTINCT flight_id) AS total_flights,
AVG(arrival_delay) AS avg_arrival_delay,
AVG(departure_delay) AS avg_departure_delay,
MAX(arrival_delay) AS max_arrival_delay,
MIN(departure_delay) AS min_departure_delay,
departure_city,
arrival_city,
COUNT(DISTINCT CASE WHEN arrival_delay > 0 THEN flight_id END) AS delayed_flights,
COUNT(DISTINCT CASE WHEN arrival_delay <= 0 THEN flight_id END) AS on_time_flights
FROM
TransformedFlights
GROUP BY
airline_code, airline_name, airline_country, departure_city, arrival_city;

Loading phase

This involves inserting the transformed data into the destination database or data warehouse. Here, it involves inserting the data into a star schema or a fact-dimension model.

Here, ‘fact_flights’ is a fact table containing key metrics about flights; ‘dim_airlines’ is a dimension table containing information about airlines; and ‘dim_airports’ is a dimension table containing information about airports.
The query below demonstrates the loading process into the star schema.



-- Loading transformed flight data

-- Loading into the fact table
INSERT INTO fact_flights (
flight_id,
flight_number,
departure_airport_code,
arrival_airport_code,
departure_datetime,
arrival_datetime,
airline_code,
arrival_delay,
departure_delay
)
SELECT
flight_id,
flight_number,
departure_airport_code,
arrival_airport_code,
departure_datetime,
arrival_datetime,
airline_code,
arrival_delay,
departure_delay
FROM
TransformedFlights;

-- Loading into the airline dimension table (assuming it doesn't exist yet)
INSERT INTO dim_airlines (airline_code, airline_name, country)
SELECT DISTINCT
airline_code,
airline_name,
airline_country
FROM
TransformedFlights
ON CONFLICT (airline_code) DO NOTHING; -- Using ON CONFLICT to avoid duplicate entries

-- Loading into the airport dimension table
INSERT INTO dim_airports (airport_code, airport_name, city, country)
SELECT DISTINCT
airport_code,
airport_name,
departure_city,
departure_country
FROM
TransformedFlights
UNION
SELECT DISTINCT
airport_code,
airport_name,
arrival_city,
arrival_country
FROM
TransformedFlights
ON CONFLICT (airport_code) DO NOTHING; -- Using ON CONFLICT to avoid duplicate entries

Building an ETL (Extract, Transform, Load) pipeline with SQL offers a powerful and efficient solution for managing and transforming data within organizations. The use of SQL provides efficient interaction with databases, making it a versatile tool for handling diverse data sources and formats.

Thank you for reading! If you found this article interesting, do consider following me and subscribing to my latest articles. Catch me on LinkedIn and follow me on Twitter

--

--

Nnamdi Samuel
Art of Data Engineering

Data Engineer💥Voracious Reader and a Writer || Chemical Engineer