From an email attachment to an Airline business insights using AWS (PART 2)

Emiliano Conza
8 min readMar 28, 2023

--

An end-to-end data pipeline on AWS (step by step Tutorial)

Introduction

Welcome back to Part 2 of the step by step tutorial to build an end-to-end data pipeline on AWS, aimed at collecting data coming from an imaginary Airline in the form of email attachment, processing and analyze them to derive business insights.

AWS E2E Data Pipeline

In Part 1, we covered the Data Ingestion and Collection step and partly the Data Processing one, mirroring a real-life scenario of an imaginary Airline sending flight level data to our organization through an email and using AWS WorkMail, Amazon SES, Amazon S3 and AWS Lambda to retrieve the email, extract and process the attachment before saving the data to destination S3 bucket. If you haven’t read Part 1 yet, I highly recommend you go back and check it out before diving into this article, as it will provide context for what we will be discussing here.

In this Part 2, with the attachment now available in AWS in an S3 bucket, we have the opportunity to gather insights from it by building a serverless data analytics architecture for data processing and visualization.

In particular, we will use AWS Glue to crawl the S3 bucket, transform the data and catalog them. Eventually, we will connect Athena to the S3 for ad-hoc querying and dataset creation and use Amazon QuickSight for dashboard design.

So without further ado, let’s dive into it!

#1 — AWS Glue configuration

As per AWS documentation “AWS Glue is a scalable, serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

The first step is to set up a Glue crawler and automatically populate the Glue Data Catalog. Note: while not covered in this article, in such instances it is a best practice to also use Glue ETL to convert the .csv file stored in S3 into Parquet, which is a columnar format that eventually will improve Athena query performance.

Crawler configuration steps below:

  • Navigate to AWS Glue in the console, and click on Crawlers -> Create crawler.
  • Define a crawler name and choose S3 as the data source to be crawled. Enter the path of the S3 bucket containing the email attachment data, in my case airline-email-extract-attachment bucket.
  1. Create an IAM role to be able to access the S3 bucket.
  2. Create a Database. This will be the database in the Glue Data Catalog where the table generated by the crawler is going to be stored.
  3. Choose a Frequency. For this exercise I will be using ‘On-demand’. This will be dependent on the use-case. An option used in event-driven architectures is to fire a Lambda once the data gets delivered to the S3 bucket (using S3 event notification) and use the Lambda to invoke the crawler right away. However, this is outside the scope of this article.
  4. The crawler is now created. Click on ‘Run crawler’ to trigger the run.

The run was successfully completed. The feature ‘View Log’ provides CloudWatch logs behind the crawler run.

Going on Data Catalog -> Tables it is possible to check the creation of the table. It will have the same name as the S3 folder (in my case airline).

Clicking on the table name, you can see the different columns and data types. If any wrong data type, there is the chance to update it: this will create a different version of the table.

#2— AWS Athena configuration

With the successful completion to the AWS crawler run and the creation of a table in the Glue Data Catalog, there is now the opportunity to use Athena for ad-hoc querying. As per official documentation, “Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives”.

Steps below:

  1. Navigate to AWS Athena in the console and open the query editor
  2. Before being able to run the first query, you need to set up a query result location in Amazon S3. I have created an airline-test-athena-output folder in the existing bucket and referenced it in the Athena Settings.
  3. Select the DB on the Athena dropdown menu and run a sample query. If everything is configured properly, results will show up at the bottom of the editor, and get saved in the athena-output folder previously set.

#3 — AWS QuickSight configuration

As per AWS documentation, “Amazon QuickSight powers data-driven organizations with unified business intelligence (BI) at hyperscale”. Let’s use it to build a dashboard and gather some business insights from the flight data attachment.

Steps below:

  1. Navigate to QuickSight in the console and sign up if necessary.
  2. Go on Datasets -> New dataset and select Athena as a data source. Give it a name (called mine AthenaDS_test_airline) and Create. At this point, you get prompted to choose the table (in my case named airline). Once done, you can finish the dataset creation.
  • Note: you can import data to SPICE for quicker analytics. SPICE (Super-fast, Parallel, In-memory Calculation Engine) is the robust in-memory engine that Amazon QuickSight uses. This is contrary to a Direct Query approach, for which every time you create/edit/view the dashboard, the Athena Query is triggered.
    • In the real life, it is pretty much about deciding between performances and advanced features such as SageMaker ML data augmentation (SPICE) vs cost savings (Direct Query). For this exercise I have selected Direct Query.

With the data imported, it is now time to prepare them for visualization. As stated at the beginning of the article, the final goal of the dashboard is to provide insights behind On-Time-Performance KPIs, checking in particular the D15 (Departure Delay ≥ 15 minutes) and A15 (Arrival Delay ≥ 15 minutes) metrics. Three main steps are needed to achieve this:

  1. Data exploration and cleansing
  2. Calculated fields creation
  3. Dashboard visuals design
  4. Dashboard publication

Data exploration and cleansing

  • Navigate through Dataset, select the previously created Dataset and click on ‘Edit Dataset’.
  • Explore the data and check the date type. In particular check the format of the flight dates and flight departure/arrival times, saved as a String. Also, the departure/arrival fields are only showing the time, without the date.
  • A solution is to create a Datetime field from a string by using the parse_date command, concatenating the flight date with the time. Click on Add calculated field and create the datetime for each relevant metric (scheduled arrival time showed in the snapshot below).

Once done with the above, save and click on “Publish & Visualize” to finalize the dataset editing and open the dashboard.

Calculated fields creation

  • The KPIs to be created and shown are the ‘A15 OTP%’ and the ‘D15 OTP%’, which represents respectively the share of the arrival delays and the departure delays higher or equal than fifteen minutes, out of the total number of flights flown.
  • Click on Add -> Add Calculated field
  • Create the D15 and A15 OTP% KPIs:
  • Use the dateDiff function to calculate, in addition, the amount of the Arrival Delays and Departure Delays, in minutes.

Dashboard visual design

  • The D15 and A15 OTP % metrics can be shown using the KPI Visual Type. By going on “Conditional formatting” there is the possibility to set a color and a custom icon (in this case the arrow) when a certain condition/threshold is fulfilled. For D15 OTP % I chose a target of 85%, reason why the 90% result is showing as green.
  • Note: in the real world, historical data will also be available. In such cases, the KPI visual also helps to highlight the deviation DoD and WoW, providing an additional insight to look at.
  • The top offenders for A15 and D15 delays can be visualized in a simple pivot table, adding flight number and Origin/Destination airports as additional fields.
  • For the Arrival and Departure delays, a stacked bar chart can be used to check the distribution of the delays among Origin and Destination airports.

Dashboard publication

  • Once the design is finalized, on the top right click on “Share -> Publish Dashboard”.
  • The final result is shown below. Our imaginary airline is at target for D15 OTP% while being below target for the A15 OTP%. FCO airport (Fiumicino) is among the top offenders for both departure and arrival delays.

#4 — Tutorial wrap-up and final remarks

This marks the conclusion of the step by step tutorial to build an E2E data pipeline using AWS with the goal to derive business insights from a simple email attachment with flight data coming from an imaginary Airline. I hope you have found this exercise informative and I encourage you to get started with AWS and experiment on your own. Thank you for taking the time to read this article, any feedback provided in the comments would be greatly appreciated!

Final remark: the above exercise is intended to be used solely to familiarize with AWS services and get your hands on the cloud. For production deployment, additional considerations would be required, outside the scope of this article. Once completed, do not forget to delete the AWS resources used in this tutorial, not to incur in extra costs.

--

--

Emiliano Conza

BI Engineer | Cloud technology enthusiast | Piano and guitar player. Opinions here are my own.