Exploring AWS Glue Part 2: Crawling CSV Files

Johannes Giorgis
Acing AI
Published in
9 min readMar 11, 2021

--

Welcome to Part 2 of the Exploring AWS Glue series. As a recap, a lack of articles covering AWS Glue and AWS CDK inspired me to start this series to demonstrate how we can leverage Infrastructure As Code tools (CDK) to build reproducible Data Science projects.

Photo by Mariana B. on Unsplash

Our first challenge will be to write some infrastructure code that will allow us to ingest CSV files and query them on AWS.

First, we will go over the tools and pre-requisite steps you’ll need to follow along. As this series is about using these tools, pre-requisite steps such as how to open an AWS account, how to set up the AWS CLI…etc aren’t covered. Next, I’ll walk through the different stages I took to set up this project before wrapping up with my thoughts on how to improve it.

With that out of the way, let’s get started :)

Tools

For this first challenge, I used the following tools/services:

Pre-requisites

You should have the following setup:

  • AWS account
  • AWS CLI v2 is installed and configured correctly
  • AWS CDK installed via npm i -g aws-cdk (v1.91.0 as of February 2021)

AWS CDK

Before we jump to the exciting part, we need to conceptually understand AWS CDK, and why I’m quite excited about it. From its AWS service page:

AWS Cloud Development Kit (AWS CDK) is an open source software development framework to define your cloud application resources using familiar programming languages.

The advent of AWS ushered in this period where we could spin up tens, thousands, or millions of servers through a web interface in practically minutes. Where before we would have to work across the organization to procure servers, we were now empowered to quickly spin up what we needed to get to inventing cool stuff :)

Managing anything more than a handful of servers gets complicated pretty quickly. This has been traditionally the domain of operations teams. But even for them, managing thousands of servers was not a trivial task. They needed a way to programmatically control and manage all this infrastructure. This gave rise to infrastructure of code tools:

  • Chef (2009)
  • Puppet (2005)
  • AWS CloudFormation (2011)
  • Ansible (2012)

These tools however were aimed at the operations teams and system administrators. They weren’t intended for developers. Hence they introduced their own syntax/languages and used workflows that naturally fit with how operations teams worked. As an example, Terraform didn’t have a for loop functionality until sometime in mid-2018!

This brings us to AWS’ Cloud Development Kit (CDK). CDK allows us to use the same programming language we use for our application to define our infrastructure as code.

How cool is that?! Now we can write a full stack TypeScript/JavaScript web application (Front End, Back End, and Infrastructure) or a Python Web Service (Back End and Infrastructure).

CDK empowers developers to build Full Stack applications or services and deploy them in the cloud using the same language

We will dive deeper into AWS CDK in the next article. For now, let’s get to building!

Set Up Infrastructure via AWS CDK

All right! We’re now in the exciting part: Building Stuff!

So to recap, we want to define infrastructure that will allow us to ingest CSV files and query them on AWS. The Tools section already mentioned the tools/services we will use. So let’s understand why.

I don’t want to set up a Database. To me, that’s the messy route. Remember we want these personal projects to stay small and simple — we want to quickly spin stuff up and shut it down. Databases take time to deploy and introduce administrative overhead. So for now, we will skip it. This leads us naturally to a serverless based solution:

  • S3: for object storage
  • Glue: managed serverless data ingestion service
  • Athena: SQL engine to help us query our data

For this project, we will create one stack — it will define our S3 and Glue crawler resources. The Glue crawler will create the tables on Athena. We will call this stack, CSVCrawler.

Project Set-Up

First things first, let’s set up our project.

Let’s create the folder project csv_crawler, enter it, then initiate our Python-based CDK app.

mkdir csv_crawler
cd csv_crawler
cdk init sample-app --language python
Project Tree

Our Project Tree should look like the above. cdk init takes care of creating everything within the parent csv_crawler directory — the app and test directories, the virtual environment directory, our project README.md, and requirements.txt files. Next, let’s activate our virtual environment, upgrade pip and install our project dependencies from the requirements.txt file that was created. Note that by default, aws-cdk.aws-glue is not included so we will have to install that separately.

# activate virtual environment & upgrade pip
source .venv/bin/activate
pip install --upgrade pip
# install requirements
pip install -r requirements.txt
# install missing aws-glue library
pip install aws-cdk.aws-glue
# set up & verify aws cli is present :)
λ which aws
/usr/local/bin/aws

I like to double-check everything is working fine within the virtual environment we just created and setup. Since cdk init creates some tests, we can run pytest to verify everything looks good.

pytest results

CDK Stack

Let’s take a look at our CSVCrawlerStack code:

CSVCrawlerStack class

Let’s walk through what this code is doing:

  • We import the libraries we need
  • We define our CsvCrawlerStack class
  • We create a bucket which we call ‘csv-bucket’ → This isn’t the name of the bucket, just what the id of the resource within our class
  • We create an iam role for our Glue crawler which will provide us with full access to S3
  • Lastly, we create the glue crawler, giving it an id (‘csv-crawler’), passing the arn of the role we just created for it, a database name (‘csv_db’), and the S3 target we want it to crawl

We will have to import our stack into app.py — now we have our CDK app containing our CsvCrawler stack:

CDK App

Deployment Time :)

We need to run cdk bootstrap to set up a bootstrap stack for the first time we deploy an AWS CDK app into an environment (account/region).

Bootstrapping cdk resources into AWS environment

With our environment bootstrapped, let’s proceed to verify our stack via cdk ls and deploy it:

List our app & deploy it

Verify Infrastructure

Let’s verify our infrastructure has been deployed onto our AWS environment. We can use the AWS CLI to check for the S3 bucket and Glue crawler:

# List S3 Buckets
λ aws s3 ls
2021-02-27 10:28:00 cdktoolkit-stagingbucket-1wdkn4be1gwgw
2021-02-27 10:32:37 csv-crawler-csvbuckete3c1c3b8-1kgq8cmtaf65h
...
# List Glue Crawlers
λ aws glue list-crawlers
{
"CrawlerNames": [
"csvcrawler-K97ENFxAYqre"
]
}

So our CSVCrawlerStack class created the S3 bucket csv-crawler-csvbuckete3c1c3b8–1kgq8cmtaf65h, and the Glue crawler csvcrawler-K97ENFxAYqre.

Alternatively, we can go to the AWS Console and check both the Glue and S3 pages respectively.

Get Dataset into AWS S3

Our dataset is the Capital Bikeshare Dataset from data.world which contains several CSV files.

data.world describes itself as the Cloud-Native Data Catalog. It is where I decided to grab the CSV files from. Obviously, it has way more than CSV files. It is a great resource for datasets.

The files are downloaded as a zip file. Unzipping them, I noticed the file names contain spaces. As I will use a script to upload them to S3 versus uploading them via the AWS Console, I renamed them by replacing the spaces with hyphens. rename is a convenient command-line tool for this:

Renaming CSV files

The following bash script was used to upload the csv files:

#!/usr/bin/env bash# Upload CSV files to S3echo "Uploading csv files to S3..."# s3 bucket/path
s3_bucket="csv-crawler-csvbuckete3c1c3b8-1kgq8cmtaf65h"
s3_path="s3://${s3_bucket}/csv/"
# local dataset
current_dir=$(pwd)
data_set_dir="${current_dir}/data/codefordc-capital-bikeshare"
for file in $(ls $data_set_dir)
do
echo "Uploading $file..."
file_full_path="${data_set_dir}/${file}"
# echo "${file_full_path}"
# echo "aws s3 cp ${file_full_path} ${s3_path}"
aws s3 cp ${file_full_path} ${s3_path}
done
# Verify
echo -e "\nVerify files on S3:"
aws s3 ls ${s3_path}
echo "Completed uploading csv files to S3!"

Afterward, I verified the files were present on S3:

Verifying files are on S3

Run Glue Crawler

So our setup is done — we have our data uploaded to S3 which is serving as our data source for our Glue crawler. Let’s check the Glue crawler:

Glue Crawler

Notice the Include path is the S3 location we uploaded our files to. So we are good to proceed with running our crawler. We can select our crawler and click ‘Run crawler’ from the Glue crawlers page:

Table added after running our crawler for the first time

See how it added a new table? It worked! Pretty cool, huh?

Query Our Data

Let’s go to Athena. Since this is the first time we’re using Athena, we need to set up a Query result location:

Initial Athena Setup

Now, we are able to query our dataset using SQL :)

NOTE: I actually faced an issue here — Athena complained about our table containing duplicate columns. I had to manually edit the table schema on Glue to fix this issue.

Athena complaining about duplicate columns in our table

After fixing the duplicate column issue, I was able to query the dataset successfully:

Querying our CSV Dataset on Athena

Isn’t that pretty cool? Using the power of SQL to query CSV files without setting up a PostgreSQL, MySQL, or any other database.

Your Turn

Now what’s super cool is you can take my code, deploy it to your own AWS environment, upload your own CSV files to the newly created S3 bucket and crawl it to have a serverless database of your very own.

And once you’re done, you can delete the S3 files and Glue table(s), you can run cdk destroy csv-crawler to destroy all the AWS resources you just created.

How can we improve this?

This was a project I put together in a few hours. The main focus was to show how easily we can set up the infrastructure we need to do Data Science projects on our own. There’s plenty of areas we can further refine this project. Below are a few that crossed my mind:

  • Limit our Glue crawlers to only have access to our S3 bucket vs. having full S3 access
  • Programatically get our data from data.world, clean it up and upload it to our S3 bucket
  • Add command-line arguments for running Glue crawler
  • Add resource destroy commands

If you made it this far, thanks again for reading :) Next time, we’ll do a more theoretical walk-through of what AWS CDK is and how to use it.

--

--