Serverless ETL on AWS — From XML to RedShift via Lambda and Glue
As part of building an enterprise Data Lake for a hot fintech startup, we had to deal with data from multiple sources and one of those sources is of XML type. Apart from being able to load data to Data Lake on S3, we also had to make sure that the pipeline takes care of loading data to RedShift — as fast as it can.
The challenge with this particular XML though was that the type of data varied with each file, it was difficult to arrive at a strict schema. So here is how I went about dealing with this data. It’s not much — but I thought some of you might have similar problems and this would help.
Below is the high level overview:
- Normalising the unstructured data from xml and converting it to a bunch of CSV files
- Trigger a lambda function on each new S3 object, and convert to CSVs to a target bucket
- Catalog tables using Glue Crawler
- Run a job to load these tables to Redshift database
Time for more details now:
1. The XML and the Cleanup: The tough part was coming up with a schema for data that looks like below.
Starting from scratch, me being a newbie to Python, I began with using
xml2csv from the
xmlutils library. The problem I ran into is, the as is conversion was all jumbled up.
So the approach I followed to further CSVify (:D) the XML, beautifying the input files using Beautifulsoup4 and further convert output of Beautifulsoup4 into LXMLs using LXML library as those have XPaths easily iterate-able followed by hard-coding all the attributes tagged in XML and nullifying them if the tag is empty, which finally landed me to this output:
When all of this code was working well locally, it was time to take this to Lambda.
2. Lambda for sanitizing on the fly: The plan is to trigger the above logic I came up with on each new XML file added to the S3 bucket and output converted csv to another s3 bucket. Below is the Python code that got converted to Lambda.
Some of the lessons I learned while working with Lambda -
As the code had multiple dependencies associated to it, so just running it over lambda console wasn’t the solution to the problem as we can’t load dependencies there. Also, fetching the objects from S3 to further process it wasn’t as simple as fetching files locally. You don’t control the server (Serverless!) on which Lambda executes, you can’t depend on /tmp/ folder to manipulate your files temporarily.
Loading the code with dependencies to the lambda function was also tricky. My file size was exceeding 10MB and that’s the limit. So with the help of a blog by Moe Aboulkheir, actually helped in accomplishing the goal by making a folder consisting of all the dependencies + code and loading into Lambda console using EC2 instance having required version of python installed in it.
Now that we have data in CSV — time to load it to RedShift.
Enter Glue. It is a fully managed ETL service. We can create and run an ETL job with a few clicks in the AWS Management Console. We just need to create a crawler and instruct it about the corners to fetch data from, only catch here is, crawler only takes CSV/JSON format (hope that answers why XML to CSV).
Running a job: A sub function of Glue, called Job, actually performs the job of loading schema to S3/JDBC, where ever you instruct it to. You can instruct it about source table to feed from and destination as JDBC mentioning the details of your Redshift clusters, if you want to load data to Redshift.
And finally, once the data is loaded to RedShift, time to query using your favorite tool — I used RazorSQL
This could be enhanced further by
- Adding loadid column to CSVs derived froma specific XML file in order to efficiently join later on in RedShift
- Glue might be too powerful a tool for this simple job, could have loaded the data to Redshift using aws-lambda-redshift-loader
Special thanks to RK Kuppala for reviewing the article. Hope you found it useful. Happy data processing! :)