From an email attachment to an Airline business insights using AWS (PART 1)

Emiliano Conza
8 min readMar 27, 2023

--

An end-to-end data pipeline on AWS (step by step Tutorial)

Email is one of the most common and convenient communication tools used by individuals and businesses alike. It is common, when dealing with external parties, to receive attachments as part of email with important data points to be processed further.

However, with the increasing amount of email attachments being sent and received, processing these attachments manually can be a time-consuming and tedious task. This is where automation comes in, and cloud computing platforms like Amazon Web Services (AWS) offer powerful tools for automating email attachment extraction and for deriving an end-to-end data analytics pipeline to gather business insights out of the processed data.

It is also worth mentioning that in the current IT era, the majority of applications hosted on different environments communicate through APIs, with data transferred directly between systems in a structured and standardized format, without the need for manual processing. However, not all use cases may be suitable for API-based communication.

In fact, in many companies, exchanging data through email attachments is a much more widespread practice than many people may realize. Ultimately, the choice between email attachments and API-based communication will depend on the specific needs of your organization and the type of information being exchanged.

By the end of this article, you will have a clear understanding of the benefits of using AWS services for automating email attachment processing, and how to get started with implementing these services in your own organization.

The entire process will mirror a real-world scenario of an imaginary Airline sending flight level data to your organization through an email: once ingested and collected using AWS WorkMail, Amazon SES and Amazon S3, data will be processed with a Lambda and catalogued using AWS Glue, and will be finally analyzed and visualized using Amazon Athena and QuickSight, leading to the creation of a dashboard with a focus on the airline OTP (On-Time-Performance) KPIs. The final architectural diagram is shown below:

AWS E2E Data Pipeline

Note: in the PART 1 of this article, we will cover only the Data Ingestion and Collection piece. Check out the PART 2 for the Data Processing, Analytics and Visualization.

Let’s jump into it.

#1 — Get an Airline sample flight dataset from ChatGPT

The first step would be to have some dummy data to use as our email attachment. Given that we are mirroring a use case of an airline sending us flight data, I have asked ChatGPT to provide a sample dataset with EU flight level data, clearly stating the required columns. Please note that instead you could use platforms like Kaggle to get access to open source datasets with airlines data.

I have then imported the data in a .csv. ChatGPT generated flights coming from different airlines, however for the sake of this article we can assume that the data belong to the same imaginary airline, referring to the same day (05MAR23). In a real-world scenario the airline is supposed to send over an email with flight data attached on a daily basis. Here we will work on processing only the 05MAR email/data. The file has been named ‘2023_03_05_Airline.csv’. In the real life, the expectation is that the Airline will keep this standard naming convention on a daily basis: this will help having a proper partitioning of the data in the S3 data lake.

#2 — Amazon WorkMail configuration

The second step of the E2E data pipeline is to create a dedicated email server as a recipient of the daily email sent by the imaginary Airline, and to process the email to extract the attachment with the flight data. AWS comes to rescue with Amazon WorkMail and Amazon SES (Simple Email Service).

As per AWS documentation, Amazon WorkMail is a secure, managed business email and calendar service with support for existing desktop and mobile email client applications. For our use case, I’m going to create a dedicated email server. Follow the steps below to replicate:

If you haven’t done it already, log into the AWS console and search for the WorkMail service.

  1. Select ‘Create Organization’ to provide email addresses to groups of users in your company.
  2. Select an email domain. Options: Route53 domain, External domain (e.g.: hosted on something like godaddy.com) or a free test domain. For the sake of this exercise, I selected the Free test domain option and set test-airline as an alias.
  3. Click on the newly created organization test-airline and go on Users -> Create user. Choose a user name and set up the primarily email address to be used for this user. This will be the email address to communicate to the airline and where to receive their email communications. In my case, the created address is flights-data@test-airline.awsapps.com.

#3 — Amazon SES configuration

Following the creation of the email server using Amazon WorkMail, AWS automatically triggers the creation of the test-airline.awsapps.com domain in SES, verifying his identity. Also, in ‘Email receiving’ it creates an active rule set named INBOUND_MAIL.
This is where we will create a new rule, for the existing active rule set, that will send the received email into a predefined S3 bucket (for raw data storage). Steps to follow:

  1. Click on ‘Create rule’ under Configuration: Email Receiving -> Active Rule sets.
  2. The rule requires a rule name and a recipient. The recipient is the above created flights-data@test-airline.awsapps.com.
  3. Select the action ‘Deliver to S3 bucket’, and choose an existing bucket or create a new one directly from the form. I have called mine ‘airline-email-test’.
  4. Follow the steps (click Next) and save the rule. You have now created the connection between SNS and S3. A copy of the incoming email to the created email address, will be sent in the airline-email-test S3 bucket in raw format.

#4 — S3 Event Notification + AWS Lambda configuration

As stated above, the message copied in the S3 bucket is in a raw format not readable. A lambda function can be used to read the email, extract the attachment and eventually save the file (flight data attachment) into another S3 bucket which will represent the destination bucket from which to derive the analytics architecture.
The first step is to create a trigger (event notification) to call a Lambda as soon as the file gets delivered to the S3 bucket. Steps below:

  1. Go to the S3 service and open the bucket (in my case airline-test) created to receive the incoming email from SES.
  2. Go to Properties -> Event notifications and click on ‘Create event notification’
  3. Create the event with a type under Object creation as Put/Post. Then select Lambda function as a destination, which we will be creating right now (see second step below).

The second step is to create a Lambda to process the raw data and save the flight data attachment to a destination S3 bucket (which you should also create, mine is named airline-email-extract-attachment). For the Lambda, I have used a Python runtime and the library ‘email’ to process the raw email and the attachment. Steps below:

  1. Navigate to Lambda in the AWS Console.
  2. Go to Functions and click on Create function. Provide a name (in my case extract-email-attachment), a Runtime (in my case Python 3.9).
  3. Important: at this point, give appropriate permissions to Lambda to read from the S3 raw data bucket (GetObject) and write the results to the destination S3 (PutObject). I have created an IAM role for it, following the steps explained in this article.
  4. Copy the code from my GitHub repository at this link. Code highlights below:
    - Get s3 object contents based on bucket name and object key
    - Use the email library to extract the email, message content and attachment from the s3 object content coming from SES
    - Construct the destination key with the airline name, file name and original file extension. Then upload the attachment to the S3 destination bucket. In this way, if the emails come from different airlines, the attachments are stored in the S3 in different ‘folders’, one per airline.
  5. With the Lambda created, conclude the step above of the S3 event notification, mapping this Lambda as the destination of the event.

#5 — Final Check: send the email with the attachment and enjoy the automated ingestion/collection

After completing all the steps above, it is now time to test our architecture by sending the email with the flight data attached and checking if the attachment has been properly processed and saved into the S3 destination bucket.

  • Send the email with the attachment at your previously defined receiving email address. In my case, flights-data@test-airline.awsapps.com. The attachment naming convention will follow this rule: YYYY_MM_DD_{AirlineName}.csv
  • Navigate to the S3 destination bucket to check if the email has been processed and the attachment extracted and saved. Feel free to use CloudWatch logs to check what’s happening in the background.

The email attachment has been successfully processed! The S3 destination bucket (in my case named airline-email-extract-attachment) is showing the .csv flight data attachment, with the prefix being Airline/, which will change for each different Airline sending us their flights data.

#6— Wrap-up and next steps

This concludes the Part 1 of the exercise. I hope you had fun and could successfully follow the step by step process to extract email attachments using AWS. With the attachment now available in an S3 bucket, in the PART 2 of this article we will build a serverless data analytics architecture for data processing and visualization, gathering insights from the flight data collected. I’ll see you there!

Final remark: the above exercise is intended to be used solely to familiarize with AWS services and get your hands on the cloud. For production deployment, additional considerations would be required, outside the scope of this article. If you don’t plan to move the PART 2, then do not forget to delete the AWS resources used in this tutorial, not to incur in extra costs.

--

--

Emiliano Conza

BI Engineer | Cloud technology enthusiast | Piano and guitar player. Opinions here are my own.