In an ongoing attempt to keep myself busy with cloud tech, I came across the Cloud Resume Challenge a while back although I found it after the challenge had already concluded. Regardless, I still joined the Discord server and got stuck in creating my own Cloud Resume.
After a few weeks of being active on the Discord server, Forrest Brazeal, the creator of the Cloud Resume Challenge, announced the next iteration of his Cloud Challenge except this time he partnered up with his employer, A Cloud Guru, to create the #CloudGuruChallenge series. Every month, a new instructor from A Cloud Guru will pose a challenge that will touch on new topics and you have until the following month to complete. Once successfully completed, the instructor who designed the challenge can then review your project and endorse you on LinkedIn.
This months challenge was to build an ETL process that grabs COVID-19 data from two online resources as CSV data, manipulate them and then store them in a database. Once there, we then need to create a dashboard to visualise the data being stored. I am not going to go into too much detail about every step of the process because then reading multiple blog posts on this topic becomes repetitive so I’m going to give some top level info into a few things that challenged me or what I did differently.
Since cost was (and almost always is) my #1 concern, I went with DynamoDB since a portion of it’s use is covered by the always-free free tier. I knew that with the amount of data I was working with, I could stay within the free tier for a long time even with doing full table scans daily. This was the only constant in my mind, everything else (such as method to visualise the data) was still up for debate.
I used Pandas DataFrame object to handle all data. I read the CSV’s from their URL’s directly into a DataFrame. When updating the database I also did a table scan and loaded all the data into a DataFrame. I could then do a “diff” of sorts to see what data is present in the “new” DataFrame (from the CSV) that wasn’t present in the “old” DataFrame (from the DB). I found that going with this approach, I didn’t have to keep track of previous Lambda runs to ensure they ran successfully and if I had to back-fill yesterdays data. With my method, any new data that didn’t previously exist would be written to the DB. As yet another benefit, this helped me with “initial load” versus “incremental load” since the first run would upload all data and every subsequent call would simply update that data, if there is data to update with!
For interest sake, this is what it looks like from the command line when I run my Lambda for the first time to populate the database. Remember to use logging in your Lambdas instead of
Infrastructure as Code
I have come to rely on Terraform for even the simplest of cloud infrastructure tasks. If you make Terraform (or IaC in general) a base habit and incorporate it early on into your project, I find it really helps you to keep growing your project/infrastructure in a manageable way. At any moment I could simply run
terraform destroy and then
terraform apply as a sanity check to make sure all resources I need are accounted for with Terraform and nothing was left behind.
To write my Lambda functions I made use of the Serverless Framework. Often I would use Terraform to create a resource that relied on my Lambda function existing (or vice versa) in order to function. There is a trick to being able to incorporate Terraform and Serverless together. From the Terraform side of things, you can make use of null_resources and depends_on to cause Terraform to execute your
sls deploy which will create the Lambda (or other resource) that you need to reference. The actual magic sauce here is to then use AWS’s SSM Parameter Store to store things like Lambda ARN’s, DynamoDB Stream ARN’s and so on. Using the Parameter Store you can then pass data back and forth between Terraform and Serverless.
This step was pretty straight forward, I created a DynamoDB Stream which triggers a Lambda when data is inserted into the DB which in turn processes the rows of data that was said to have been added and then it pushes that data into an SNS topic which I was subscribed to so I get an email with the rows that were added.
This is what an initial load looks like, screenshot trimmed for brevity.
And then this is what an incremental load looks like.
I have also made it a habit to make use of git hooks to keep my code formatted and abiding by PEP8. I use a pre-commit hook to run pylint, black and
terraform fmt . Doing so means that if pylint or black exit in an error, I simply cannot commit my code so I have to fix it before the world can see my embarrassing errors.
Due to the fact that all my infrastructure and code comes up with a simple
terraform apply I started to incorporate my project with Github Actions to continuously deploy my project but at the time of writing I haven’t finished this step with all the bells and whistles I wanted, such as running my python tests and getting it’s code coverage so I have omitted it from my Git repo.
This was a pretty cool challenge and it was right up my alley, I really enjoyed the way it mixed a lot of interesting topics into one! Thanks to Forrest Brazeal for starting the #CloudGuruChallenge and thanks to all the guys on Discord who are great to chat with regarding Cloud tech!
For posterity, here is a screenshot of most of my dashboard: