Serverless Transactional Email Setup with Analytics using AWS SES, Glue, Athena, QuickSight

Pratik Bhopal
Aug 18, 2019 · 6 min read

With AI, analytics all over, businesses these days wants to track as consumers behavior as much as possible. And when in comes to emails, they are not only tracking their marketing emails but also transactional emails. Mailchimp is one popular email marketing tool but those who prefer not to upload their clients or leads data with other software company has an open source option like Sendy as well. For transactional emails as well there are various ready to use options available in the market like SendGrid, Sendinblue etc. with analytics built in. I couldn’t find any opensource tool and not sure if any exists.

Problems with available solutions

  • Unlike marketing emails, transactional emails contain sensitive data and companies may be skeptical about using third party services for it.
  • Apart from security and sensitive data, cost also is a factor that needs to be considered.
  • Some tools retain data only for a few days and in case you want to store data permanently, you’ll need call their APIs and create a platform to store them and analyze, which is significant amount of work given the fact that you are paying the vendors for their platform too.
  • Maintaining your own email servers can be an overhead and one can be much comfortable in using the services provided by the tech cloud giants like AWS, Google, Microsoft etc. instead but these don’t come with any features for email analytics.

Let’s see how we can address each of these problems using a combination of tools provided by AWS:

SES is one of the most popular service of AWS which is used for sending emails. SES comes with basic built in analytics but that’s on aggregate mode. You don’t get to know for example whether the mail got delivered to specific person or not. For advanced tracking, create a configuration set and select a destination where you want event logs to be pushed. Cloudwatch, SNS, Firehose are the three options available currently. I have used SNS as per our business requirements. Check all the events you want to track and save it. Please note that I am not covering the step by step guide on how to configure SES, for which various other articles are available.

Another important setting that needs to be configured in your software is the header while sending the mail. You need to pass the name of configuration set while sending the mail through SES, which enables SES to identify the email belongs to which configuration set.

$this->headerLine(‘X-SES-CONFIGURATION-SET’, ‘test’)

You can also pass custom tags to identify the software or source etc from where mail is initiated. For eg. your email triggering source could be CRM software, ERP software, Intranet software, Website etc.

$this->headerLine(‘X-SES-MESSAGE-TAGS’, ‘Source=crm’)

My SNS is configured to push the logs to SQS, from where I pull the logs periodically and store required details in a log file in csv format, which is then pushed to S3 bucket. In my case, I have a cron job running in my server which pulls the logs from SQS but you may use lambda function for the same as well. My primary reason for doing it on the server side is that we didn’t want the email ids to be pushed in the analytics engine for data security reasons and the same is replaced with with ids at the time of adding it to log file.

Using athena, I read the logs files present in S3 by using simple query language.

My table DDL:

CREATE EXTERNAL TABLE `csv_emails_log`(
`msg_id` string,
`event` string,
`event_timestamp` string,
`subject` string,
`to` string,
`from_email` string,
`link_clicked` string,
`tag` string)

Sample CSV log file :

id-1|Click|2018-11-20 22:30:28|Email subject 1||||crm|
id-2|Send|2018-11-19 22:30:28|Email subject 2||||crm|
id-1|Open|2018-11-19 21:30:28|Email subject 1||||crm|

You’ll notice that in the screenshot, I am querying using fields month and year but those fields are not present in the table DDL I shared. And that’s because I use AWS Glue to convert the CSV log file to partition the data and convert it into parquet format. That’s an important step to do in production because Athena charges are based on amount of data processed per query. By partitioning your data and using parquet format, data processing goes down drastically. Check the data scanned difference of similar query when done on non-partitioned csv files.

I’ll share my pysaprk code below which I use to convert csv to parquet along with partitioning:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from awsglue.dynamicframe import DynamicFrame
args = getResolvedOptions(sys.argv, ['JOB_NAME'])sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "raw_data", table_name = "your_csv_file_table", transformation_ctx = "datasource0")
rec_df = datasource0.toDF()
if len(rec_df.head(1)) != 0:
rec_sql_df = spark.sql("SELECT *, year(event_timestamp) as year_p, month(event_timestamp) as month_p, day(event_timestamp) as day_p FROM tempTable")
mapped_dyF = DynamicFrame.fromDF(rec_sql_df, glueContext, "mapped_dyF")
applymapping1 = ApplyMapping.apply(frame = mapped_dyF, mappings = [("msg_id", "string", "msg_id", "string"), ("event", "string", "event", "string"), ("event_timestamp", "string", "event_timestamp", "timestamp"), ("subject", "string", "subject", "string"), ("to", "string", "to", "string"), ("from_email", "string", "from_email", "string"), ("link_clicked", "string", "link_clicked", "string"), ("tag", "string", "tag", "string"), ("year_p", "int", "year", "int"), ("month_p", "int", "month", "int"), ("day_p", "int", "day", "int")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://path_to_store_parquet_files","partitionKeys": ["year","month","day"]}, format = "parquet", transformation_ctx = "datasink4")

Athena, along with AWS Glue is a big topic in itself and not in the scope of this article. Steps mentioned above may not be clear to those who are unaware of the Athena, Glue services.


For email analytics dashboard, I use AWS’s Quicksight. To access data on quicksight, you first need to create a dataset, where you’ll see various sources to pull data from. After selecting Athena, select the table from where you want to pull the data.

Next, you can create new analysis and select the data source. Display all the key metrics needed and save your analysis as dashboard.

The idea of this article was to share my thoughts on why and how you can setup serverless transactional email server and wasn’t a step by step tutorial on how to do it. I took an udemy course “AWS Serverless Analytics: Glue, Redshift, Athena, QuickSight” to get an understanding of how Athena, Glue, QuickSight works to get this working for me. But you can reach out to me in case you have any doubt or if you are stuck anywhere in setting this up.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade