Serverless Data Engineering: AWS Glue + Lambda + Athena + QuickSight
I recently attended AWS re:Invent 2018 where I learned a lot about AWS Glue, which is essentially serverless Spark. This service allows you to have a completely serverless ETL pipeline that’s based on the powerful Apache Spark framework, which is pretty cool if you ask me.
So with that knowledge, I came up with a personal project of mine that I wanted to implement Glue into, along with other serverless services on AWS. I wanted to create an ETL pipeline + Data Lake + Reporting — all serverless — to show me how rent prices are changing over time in San Diego, CA (as measured from postings on Craigslist).
This article will cover the following:
Obviously I live in San Diego, given my interest in analyzing local rent trends. And I am also obviously a renter, so that’s why I’m only interested in rents. The housing market has gotten really hot in San Diego over the last few years and prices are ridiculous. So I want to track these prices over time and see if I can find any trends of this continuing or if there are any seasonality effects.
I ended up building an end-to-end serverless data pipeline using AWS Lambda and python to scrape data from craigslist daily, and store the data in json format in S3. Then, I have AWS Glue crawl and catalog the data in S3 as well as run a simple transformation. AWS Athena queries the cataloged data using standard SQL, and Amazon QuickSight is used to visualize the data and build the dashboard. Below is a screenshot of the simple dashboard I created:
In this article, I will go through step by step how I built this pipeline so that anyone interested in replicating a similar workflow can use this as a resource.
Below is a diagram that shows the basic architecture of my pipeline. I borrowed this architecture from this awesome post by Nathan Malishev.
Scraping Craigslist with Python + AWS Lambda
Scraping data from Craigslist is actually quite easy. I used python’s Beautiful Soup module to do this. I won’t go into detail into how to use this module but I will provide all the code below as well as a link to my github repo.
If you go to your local craigslist.org site and select apts/housing under the housing section, you arrive to a page that looks something like this (note: I have filtered for the Little Italy neighborhood in San Diego):
This page contains a list of “cards” with housing info for each posting which includes just about all the information I need like rental price, square footage, neighborhood, and number of bedrooms. Additionally, if you take a close look at the URL for this page, you can easily change and filter by neighborhood by formatting the URL in the following way:
Knowing this information, it is pretty easy to provide a list of neighborhoods to search through, build a URL for each neighborhood, and loop through each URL and pull the desired information which I then store as json in an s3 bucket. I designed my function in a way that allows for me to easily update neighborhoods to search through without me having to actually update the function in Lambda, so I have a file in s3 called
neighborhoods.txt which is simply just a list of neighborhood names on each line. It looks like this:
My script has two functions to handle this: one that reads in that file from s3, parses it, and returns a list of those neighborhoods and another that takes that list and loops through each item, returning a formatted URL for each neighborhood. Here is what those functions looks like:
Finally, I have a function that receives the list of URLs, makes a request, loops through and extracts the information I need, appends it to a list, and finally dumps the list to a json file in an s3 bucket. Here is a gist of that function:
One important thing to note: since we are going to be using AWS Glue’s crawlers to crawl our json files, the json files need to adhere to a format required by Hive json SerDe. More information on that can be found here. Essentially, the line in my
get_rental_data() function that is dumping to s3 looks like so:
obj.put(Body=json.dumps(rental_data, separators=(',', ':')))
separators argument in the
json.dumps() method strips out any spaces and newlines and collapses the json so that everything is on one line. This is a requirement for the AWS Glue crawler to properly infer the json schema.
After creating my function, I used the Serverless platform to easily upload it to AWS Lambda via the command line. After a few initial set up steps, deploying your function to Lambda is as easy as running
serverless deploy. If you haven’t been using this or AWS’s equivalent service called SAM, then I highly recommend these options as it makes it pretty painless to write and upload python code to Lambda.
Lastly, I have this Lambda function scheduled to run every morning at 9am. This is pretty easy to setup within the AWS console and there are plenty of resources that show how to do this so I won’t go into that.
Glue makes preparing, cataloging, and transforming data very easy. It’s also serverless so you don’t need to worry about provisioning or managing any servers. You simply pay for what you use while your jobs run. Glue also generates template code for your ETL jobs in either Python or Scala which you can edit and customize in case the job requires a little bit more tinkering.
For this project, I used:
- A Glue crawler to infer the json schema for my data in S3
- A Glue Job to convert the json data to parquet format
- A crawler to infer the schema from the parquet data format
Crawling data from S3 is very easy. First you need to create an IAM role for your crawler to be able to access S3. This is a good resource that shows you how to properly create it. The first thing we are gonna do is create a custom classifier to infer our json data. Glue actually comes with a ton of classifiers that can automatically detect and “classify” your data’s schema, but unfortunately the way I formatted my data (an array of json objects) required that I create my own classifier to properly crawl the data. But this is actually pretty easy.
Navigate to Glue from the AWS console and on the left pane, click on Classifiers. Click Add Classifier, name your classifier, select json as the classifier type, and enter the following for json path:
It should look like this:
This is the notation needed for the classifier to extract the json schema from an array which is how our data is formatted. To learn more about creating custom json classifiers, check out this resource.
Next, we are going to create a crawler to crawl our S3 bucket. In the pane on the left, click Crawlers and then Add Crawler. Enter a name for your crawler and then click on the Description, security configuration, and classifiers (optional) section just below so that it expands. You will see a list of your custom classifiers on the left (in this case only 1). Click Add to add it to your crawler. So far, should look like this:
Go ahead and click Next to move to the next section. For Choose a data store, select S3 and provide the path to your data in S3. For example:
which will crawl all objects in MyFolder. Click Next and in the next section for Add another data store, select No and click Next. You are now in the Choose an IAM role section. Select Choose an existing IAM role and in the dropdown, select the IAM role you created in one of the earlier steps and click Next (make sure your IAM role has permission to access the S3 bucket you are trying to crawl). Next up is creating a schedule for this crawler. I will leave it up to you to decide that schedule but for the purposes of just running this example, you can select Run on demand. The penultimate step is to select/create the database for the crawler’s output. I chose to add a database called “craigslist”. You can leave all the defaults as they are and click Next. The final section is just a review section. Ensure that everything is correct and click Finish.
After creating that crawler, you will be returned to your list of crawlers and you will see a prompt to to run it on demand now. Go ahead and run it now. It should take less than a minute but times will vary depending on the size and amount of files in your s3 bucket. Once it finishes crawling, you will notice the last column say that a table was added:
Let’s take a look at our table real quick. In the left pane, click on Databases. This will show you a list of your databases — “craiglist” in my case as that’s what I called it, and you might also see “default” and “sampledb” which are just some sample databases that AWS might provide for you. Click on the database that you created and on the next screen, click on Tables in craigslist. You will now see a list of tables in the “craigslist” database which should only have the one created from your crawler. The table name will be the name of the folder that contains the json files in s3. In my case, it’s called “craigslist_rental_data”. Click on the table and on the following screen you should see the following:
Voila! Glue properly inferred our json schema! If you recall from the python portion, we pulled neighborhood name, posting date, posting title, rental price, number of bedrooms, square footage, and the url link to the posting. Not only did Glue infer the schema, but it also inferred the data types as well! Notice how price and bedrooms which were stored as numerical values for example as cataloged as type double.
At this point, we could use Athena to query our cataloged data using standard SQL but before doing that, we are going to create and run an ETL job to convert the data from json type to parquet. Why or what is parquet? Apache parquet is an optimized columnar storage type that produces much smaller file sizes than json but also allows for much faster querying via Athena, which is precisely why we will be using it. Here is a simple snippet comparing parquet efficiency with csv file formats for example:
This is a great article explaining more of the benefits for using parquet vs. csv or json formats.
So let’s go ahead and create that ETL job in Glue. In the left pane under “ETL” click Jobs and let’s add a new job. In the Job properties section, all you need to do is name the job and select the IAM role you created earlier. You can leave the defaults for everything else.
In the Data source section, you will want to select the table we crawled previously — in my case it’s called “craigslist_rental_data”. Select it and click Next.
Next you are going to select the target data source to where the your ETL job will post its data. Select Create tables in your data target. For the Data store dropdown, select “Amazon S3”. For Format, select “parquet”. Finally, for Target path, I simply created a new folder called craigslist-rental-data-parquet in the same S3 bucket where I am storing my json files and that is what I am using here.
Next up is the Schema section where you can edit the source to target data mappings, change data types, and filter for only specific fields. Here, the only thing I’m going to do is change the target data type for the
datetime field from string to date. to change data types, just click on the data type and select the appropriate type.
Click Next and review your settings. If everything looks good, then click Save job and edit script. Glue will produce a pyspark ETL script that you can choose to edit if you need to or you can just run it as-is. In this case, we just need to run it, so click Run job.
This should take a minute or two to run. Once it is done, you should see some parquet files in the S3 folder you specified under Target path.
Now that we have generated parquet files, we will replicate similar steps we used to crawl our json data to catalog our parquet data and create a table in Glue. Go to Crawlers in the left pane and click Add crawler to add another crawler. Our steps here will be almost identical as they were for creating the crawler for json data except for 2 things:
- This time we will not be using a custom classifier on the initial section (Add information about your crawler)
- In the Add a data store section, the Include path needs to point to the S3 path holding your parquet files.
All other steps should be identical to how you created the json crawler. Once you create the crawler, run it so that it adds the parquet data to your “craigslist” database (or whatever you called your database). If you navigate to the tables in your database, you should now see two tables: your original json table as well as your newly added parquet table.
Nice! Now we are ready to use Athena to run some SQL queries on our data!
You can choose to navigate to Athena using the AWS console as you would with any other service or you can also just checkmark the table name in AWS Glue (the parquet table), click the Actions dropdown, and then click on View data.
The first time you do this, you should see some disclaimer from AWS letting you know that Athena is a pay-per-query service and that running a query will incur a small charge. The cost to run on such relatively small data, especially parquet data, will be next to nothing so you can just disregard and continue.
This action opens up a new tab and runs a simple
SELECT * SQL query on your table.
Awesome! Remember, your data is actually sitting in a file on S3 yet we were able to query it using normal SQL. How cool is that?! You can use SQL to really explore and filter your data in this way without having to actually load it into a database. That is the amazing power of Athena and the benefit of cataloging your data using AWS Glue!
The final piece of the puzzle is the visualization of your data. Being able to use SQL on data sitting in S3 sure is cool and powerful, but you can get a much better feel for your data with some simple visualizations, charts, and tables.
The first time you click on QuickSight, you will be greeted with a page telling you that you need to sign up for the service. You basically just click Sign up and go through the one or two steps to get your QuickSight environment created.
Once you are set up with QuickSight ready to go, you’ll need to create a data source and a few visualizations which are all pretty straightforward.
In my case, I created two data sources:
- With all the data from my craigslist Glue database
- One aggregated version to compute moving averages for things like price and price per square feet and also, number of unique listings per day.
I probably could have accomplished this in QuickSight with just the first data set but to be honest QuickSight is a bit limited with creating calculated fields compared to something like Tableau and I just didn’t want to spend the extra time figuring it out… so we will just use SQL to get the data as I want it.
First, on the QuickSight home page, go to:
- Manage Data.
- Next, select New data set.
- Select Athena as the data source.
- Give it a name and select Create data source.
On the next section, from the dropdown, select the name of your Athena database, in my case craigslist and select the table you want to use which will be the parquet table we created.
Select through that, confirm, and save the new data source.
Now since I am creating a second data source that uses custom SQL, just repeat steps 1–4 but instead of selecting a table from a database as we did on that last step above, you are going to want to select Use custom SQL.
And then finally paste in or type out your custom SQL (which I would recommend first testing out that it works in the Athena console).
Here is a snippet of the custom query I am running:
Confirm query and save the data source.
FINALLY, the last step — creating the actual dashboard.
Go back to the QuickSight home page and find New analysis. Select one of your data sources from the list and then select Create analysis. You will now see a canvas with a blank chart box and your field list to the left. To add your second data source, click the pencil icon to the right of Data set which can be found right above the field list.
When the dialog box comes up, click on Add data set.
Select your secondary data set and click Select.
You can now create different charts that use either one of your two data sets. You can switch back and forth between them by clicking the dropdown as shown below.
I’m not actually going to go through how I pieced together my dashboard because that could be a whole new posting. This gets you far enough to where you can play around with how to create graphs on your own. And I will probably make another post on just creating graphs, editing, and publishing a dashboard.
I mainly wanted this article to focus on the data engineering piece as it related to Lambda, Glue, and Athena.
Last thing I wanted to mention was costs. Running the Glue crawlers everyday for the month of December cost me a whopping $1.45. Glue ETL Jobs are a bit more expensive as those are the ones running serverless Spark so I only have that job running once a week, costing me $3.67 for the month. Athena, QuickSight, and Lambda all cost me a combined $0.00. And lastly, S3 costs were $0.04 for the month. So for the cost of a fancy cup of coffee, I have a serverless data engineering project pulling data from craiglist everyday and refreshing a nice dashboard each week. Granted, the data is extremely small compared to “Big Data” workloads, but still much cheaper than provisioning servers and paying for Tableau licenses :).
I hope this article is helpful to others and I’d love to see what other pet projects people are using Glue for in the comments section.