AWS Glue for loading data from a file to the database (Extract, Transform, Load)

Published in

codefully.io

9 min readApr 15, 2019

I have spent a rather large part of my time coding scripts for importing data from a file into the database. It is a common feature of an application to ask the user to upload a file with data. A usual use case is sales numbers. My approach up to now was to try and avoid at all costs the file upload and provide a form instead. This is fine for a low volume of data, but once there is more than a couple of rows then an upload makes sense.

The upload comes with lots of problems of its own. To name a few:

Data validation. Users will upload data that may not conform to what you have been expecting. For example, you may have asked for csv and the user instead uploaded tab delimited (or excel!). Or, you have been expecting the data to be in a particular order but the user switched the columns’ order around. The list is endless and therefore the problem is also a hard one to solve.
Transforming the data into the format you want. Dates should be turned into a date object of some sort, numbers should be converted to numbers etc. There may also be logic that needs to be applied so that the data is converted to something that is useful for the application. For example, a product SKU may need to be supplemented by an id that is internal to your database or an email may need to be converted to lowercase or to a user id.
Work load. If this is a web application then it needs to remain responsive. This is not going to be the case if the user asks to have 1M rows to be processed. You need therefore to start thinking about extra workers and queues.

AWS Glue solves part of these problems.

What is AWS Glue?

It is a ‘wrapper’ service that sits on top of an Apache Spark environment. I don’t have a clue what Apache Spark does, and I guess this is the beauty of Glue: I can use the service without needing to be an expert.

What does it do for the developer?

It does 3 things:

Glue discovers the schema of your data. You provide it with a sample file (I will walk you through it) and it figures out for itself how the data is structured. The data could be csv, json, xml, or custom (grok — see github and logz.io). This is the job of the Crawler. You can also manually make changes to the detected schema. This is done just once (unless your data keeps changing shape — in which case you need to run the crawler again in order to update the schema).
Based on the data schema and its source/destination, Glue will help you create a script (a job) for importing the data, transforming it and then loading it to a database. The job is the heart of the service, and Glue does a good job (no pun intended) of getting you started without any prior knowledge. Glue acts like a wizard which helps you generate a piece of code. The job creation part should be a one-off.
Provided that you have established a schema, Glue can run the job, read the data and load it to a database like Postgres (or just dump it on an s3 folder). This is the step that needs to be repeated every time there is a new upload.

Sample data

There are many great sources available. I found some interesting examples in Airbnb and also in AWS open data. I ended up with the ‘listings’ of Airbnb. It is a csv file with a few thousand rows (for the city of Athens) and about a dozen fields (columns), a rather typical example of what most applications have to deal with. The data file was stored in an s3 bucket named gluedatasource. The name is obviously irrelevant, but we will use it in a minute.

Extract from a sample input file. Airbnb listings for Athens.

The columns include numbers, strings, coordinates, and dates.

The big picture

The plan is to upload my data file to an S3 folder, ask Glue to do it's magic and output the data to an RDS Postgres. The source and sink could have been different but this seemed like a workflow that fits well with a web application.

Workflow for importing data from a CSV to a Database

Crawl it!

Firstly Glue has to crawl the file in order to discover the data schema. We need to create a crawler.

Select to create a new crawler and then give it a name:

Define the path from which the crawler will get the file(s) which it should analyze. In this case, it is the top path of an S3 bucket:

The crawler will need access to the bucket, so there should be an IAM role that allows it. Chances are you won’t have one ready, so go ahead and create a new one. I called mine AWSGlueServiceRole-deletemeRole since I wanted to throw it away later:

Decide when will the job run. I wanted it to run on demand:

Decide where the crawler will store the data it will generate. The crawler’s output is a schema of the data that it analyzes. The schema is probably a json file like this one. On the top of the menu of the left there is a link for the databases that can be used for the schema. These are not databases in the traditional sense. It is just a way to organize your information. Perhaps the naming is not that great. I selected sampledb (which I created earlier).

Finally, review and complete the crawler creation.

Once the crawler is in place then select it and run it. It should return with something similar to this:

The schema reflects the file. Here is the file again for an easy comparison:

Input file

Prepare for the ‘job’!

Glue now has a schema for the data that we want to import. It still does not know though what to do with the data. We want to load the data to a database so we have to create a job for doing that. The job is a piece of code (I assume the kind of code that Spark users would be familiar with) that Glue helps you generate.

Before getting there you need to allow Glue access to your db. For the connection to take place you need to have the db inside a VPC (creating a VPC is beyond the scope of this post, so I won’t bother with details) with a security group that allows access to Glue.

Select the security group from the dashboard of the RDS database:

Then change the rules (inbound and outbound) to allow for Glue to access the db. For the purpose of this experiment set ALL TCP traffic from all sources (0.0.0.0/0) as both an incoming and an outbound rule. The is not safe. Change it for your production setup to appropriate values.

RDS connection

Glue requires that you create a connection to your database (the data sink) so that it knows how to connect to it. Select Connections (in the Databases Menu of Glue) and ‘Add connection’.

Then enter the database name, your username, and the password.

Review and submit

S3 endpoint

There is a similar issue with S3. You need to allow access to S3 which implies the creation of a VPC endpoint. Navigate to the VPC where your RDS database resides and select Endpoints -> Create Endpoint. Basically, you are going to allow connections to be established between the VPC resources and the S3 bucket (S3 buckets do not live inside a VPC).

Start the job wizard

Here is the juicy part! Start the job wizard. Nearly everything is automatically populated. Make sure to go for python and for ‘A proposed script generated by AWS’:

Then select where is the file that you want to parse (the crawler has automatically created a source (in Databases -> Tables) which you can use.

and choose where should the output be written to (you created earlier a connection which should show up here now).

Drum roll…. The moment you have been waiting for. Map the data to the database table. On the left, you see the columns as they emerge from the crawling. On the right, you can decide what the output will be like or you could drop/add more columns.

Then Glue will generate the job code. The end result should look similar to this:

This is what you have been working for! This is the script that explains to Glue how to load your data and then map it and save it to the database. Run the script and you should have a new table in the database populated with your data. After running the script check that the new table has the correct schema. Fortunately, in my example, it did look good so I did not have to write a single line of SQL!

Postgres table, as created (and populated) by Glue.

Cost

It is all relative. Each time you run a job there is a minimum charge of $0.44. My jobs were small so I was always hitting the minimum charge (about 10K rows). I think that you have to decide for yourself if this is cost effective. Consider though the alternative: spend time developing — reusing code for performing the import and allocate resources and logic for background jobs. If you have to deal with large amounts of data (e.g. more than you would want to upload over a browser) then it seems that you also have to deal with handling large files. Overall, I’d say that Glue is cost-effective given the massive savings in development time, and it would probably be a no-brainer for larger files (I would consider anything above 5Mb to be larger than what a web app has to deal with).

It would be nice if…

Glue is a developers’ tool. Before using it in any application you would have to build a layer wrapping its functionality. A tricky part that I would love to have a solution for (that does not involve me writing it), is the data validation. When users upload data they need instant feedback along with ways to fix/ignore the data at the moment of the upload. Currently, we do this with React and various other mostly front end approaches. This is hard work which I would love to have done for me!