Do you like challenges? Of course, everyone does. Ever since I completed the Cloud Resume Challenge, I was looking forward to the next challenge. Forrest Brazeal announced on twitter last month that A Cloud Guru started a successor to the Cloud Resume Challenge with monthly projects called the #cloudguruchallenge. September month’s challenge is “Event-Driven Python on AWS,” which runs until the 15th of October 2020.
The challenge is to create an automated ETL process that utilizes COVID-19 data from two sources, validates, merges, performs the transformation, and saves to a database. Finally, hook up the database to a BI service to generate a visualization.
I took my time to look up all the information that I would need for the challenge. More thoughts were given on the final dashboard and its database compatibility. Finally, I decided on the below services and resources that I would need to complete the challenge.
- AWS Lambda using Python run time: Extract, Transform, Load
- DynamoDB & Streams: Database to hold and stream the updated data
- Google Sheets & Tableau Public: Dashboard for visualization
- Amazon SNS and Amazon SES: Notifications to Subscribers
- Terraform: Infrastructure-as-Code
- CI/CD: AWS CodePipeline, AWS CodeBuild
- GitHub: Source control for code changes
Architecture Diagram -
AWS Lambda using Python: Extract, Transform, Load -
As mentioned in the challenge, data was retrieved from the New York Times and John Hopkins dataset. AWS Lambda, with Python runtime, gets the job done. The Lambda function is invoked daily by the CloudWatch scheduled event to retrieve and process new data. Pandas library was used to clean, manipulate, and process the data. The data frame guide helped me extract the data, merge only the available data from both the sources and finally filter only US cases. An additional column was inserted to store month information. Controls were put in to check for duplicates, column names, verify data availability in the data frame after merge or filter.
I was about to hit the deployment size limit when I packaged my code. So, Lambda Layers were used instead of uploading the data every time. This contains the libraries, dependencies that I could use across multiple Lambda functions.
DynamoDB & Streams -
The transformed data were then loaded to DynamoDB. The month column that I added was used as the partition key, and the date was used as the sort key, which helped to query the items effectively and compare with the data frame to process any missing data. Finally, Lambda Destinations were configured to send notifications about the loaded data using Amazon SNS to interested subscribers.
I also configured Streams, so I can get the updated data when it runs every day. Another Lambda function was used to process the streams and send out a notification via SNS using Lambda Destinations.
Amazon SNS and Amazon SES -
Amazon SNS was configured in Lambda Destinations on both functions, so the notification can be sent if there was a success or failure. While it did its job, it's more designed to distribute notifications rather than pretty content. Who doesn’t love a fancy email? I added some HTML and CSS styling, saved as a template, packaged with Python function, and used SES along with SNS to notify when the data is processed.
Google Sheets & Tableau Public -
A lot of thoughts were put on the final dashboard. Amazon QuickSight, as everybody knew, doesn’t support DynamoDB.To get this working, I need to push the data to S3 and use Athena and QuickSight on the S3 bucket. Moreover, you need to have an Enterprise Account for your Readers. Though you have a free trial, I didn’t want my dashboard hanging there without any updates after the trial. I looked at other BI tools Qlik, Redash, Tableau, and chose Tableau Public. While Tableau Public has very few data sources to connect, it helped me have a live dashboard without any costs involved.
The Google Sheets connector is one of the data sources that refreshes every day at 2 AM, and Tableau Public pulls the latest data from Google Sheets. I scheduled the CloudWatch event at 1.30 AM so that Tableau Public can pull the information at 2 AM, and the dashboard should update right away.
I used a python package pygsheets to interact with google sheets. I also updated Lambda Layers to include pygsheets so I can use that with my Lambda function. I came across this post from Erik Rood that explained how to connect Python to Google Sheets. Once I had the ‘Service Account Key,’ i used Parameter Store to store them. The Lambda function will get the information from DyanamoDB Streams, use the credentials from Parameter Store to connect to Google Sheets, and update it.
One of the issues that I had was with the initial load. As we were using the same sheet for the bulk update, there were cases where the data would overwrite. I had to put some additional coding to check the modified timestamp of the Google Sheets before updating it so that it wouldn’t overwrite any of that information.
Terraform allows you to create, update, and version your AWS infrastructure. I was looking to complete the Terraform associate certification, and this was the perfect opportunity. I used workspaces to isolate the environment with a single set of files during the terraform call. You can use the tfvars file, but I decided to pass directly as it contained sensitive information. There are multiple options when it comes to building out the infrastructure using workspaces. I used multiple workspaces with multiple accounts—one for staging and another one for Production. S3 bucket was used as a backend for these workspaces. I could still isolate them by using different S3 buckets, but I wanted to keep it simple for this project. Check out the link to know more about deploying workspaces for different environments.
I used GitHub for my repository and decided to try out AWS CodePipeline to automate the build, test, and deployment. The actions were set to trigger on every push to GitHub. CodePipeline was defined to use CodeBuild to unit test the code and deploy the infrastructure using Terraform to Staging account. A manual approval email was sent to verify the infrastructure and a test dashboard in Tableau Public. Once this is approved, the infrastructure and code changes are deployed to the Production account. I used the Master AWS account to run CodePipeline and had separate accounts to deploy the staging and production infrastructure. I created roles in both the accounts, so the master account can assume the role and deploy the application.
Overall, it was a lot of learning and fun during this challenge. Thank you, A Cloud Guru and Forrest Brazeal, for the awesome challenge. Looking forward to the next !! You can find my dashboard and repository below:
Live Dashboard: COVID-19 US DASHBOARD