A Recommendation playthrough [2/3]: transforming data with AWS Glue

Paul Breton
BeTomorrow
Published in
11 min readJul 20, 2021
Credits to @guillaumebth_15094

We are aiming to show how it is possible to create a recommendation engine with AWS Personalize based on tracking data extracted from Mixpanel. To achieve this, we created a small mobile application enabling users to search, like and watch videos and channels from YouTube. Our final objective is for a user to get updated recommendations as he is using the application.

This article is the second of three, in which we explore how we can transform events data directly exported from Mixpanel with AWS Glue.

Our inputs are multiple JSON files (our datasets), partitioned by year, month and day. Each one contains rows with entities or interactions between entities. All this data is purely raw, and cannot be exploited as is.

Our outputs will be a transformed version of these datasets with cleaned data and enriched entities, ready to be used for any processing task (and in our article series, AWS Personalize).

If you already know what is AWS Glue, you can skip this article and directly go to the following one, A Recommendation playthrough : creating and exposing a Personalize model. If you did not read the previous one and wish to know more about User tracking, Mixpanel and their Data Pipelines, go to A Recommendation playthrough : tracking users with Mixpanel and exporting data to AWS S3.

AWS Glue icon

What is Glue ETL and what are we going to do ?

AWS Glue is Amazon Web Services’ ETL tool. This service is divided in 3 parts : Glue Catalog, Glue Jobs and Glue Crawlers. We already saw in our previous article what could Glue Catalog and Glue Crawlers be used for. Let’s now talk about the main purpose of Glue, the Extract-Transform-Load service.

ETL is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s) (wikipedia)

ETL as a schema

In concrete terms, an ETL task is generally a Python script taking data from one or multiple sources, applying transformations and finally outputting the result in a specified destination. For instance, we want to take data from our S3 raw JSON datasets, transform them, and create new datasets as CSV files in a new S3 destination. The Python code often uses tools, libraries and even frameworks from classical and renowned Python data science packages : spark, numpy, pandas, … .

In our case, we only have users and events between users and entities (video-liked, channel-subscribed, video-watched, …). An ideal output would be 5 different new datasets : users, videos, channels, users-videos-interactions, users-channels-interactions. In order to create each one we are going to process our raw initial datasets through operations like merges, joins, drops, feature engineering, … . Let’s start !

Designing your Glue Job

Job creation and visual editor

Glue Jobs has two separate User Interfaces to manage jobs. The first one (which is the basic one) permits to directly create and edit jobs through a raw Python script. Although being really adapted for the experienced user, we might want something quicker and easier to use, especially for simple tasks.

The second User Interface, Glue Studio, is a WYSIWYG editor where you can select how you want to process and shape your data. It has the appearance of a Graph where nodes are sources, destinations or transformations. The output of this editor is nothing more than the Python script, with templated content. You can at any time modify your script to adapt it at your liking.

To start with Glue Studio, go to AWS Glue in AWS Web Services, and select on the left of the webpage the “Glue Studio” tab. Here you can directly choose “View jobs” to access the creation/edition panel. Let’s create our first job ! Because we want something to start with, we are going to take the option of creating a basic graph with source and destination already filled. Here we want to extract our data from our JSON dataset in a S3 bucket, and export it to CSV files to another S3 bucket.

Starting block to create a job

We can see our editor and the nodes of our graph. Everything now is about editing these nodes’ properties to create our process.

Base template when creating a Glue Job in Glue Studio

In the “script” tag, we can see the generated code associated to this graph. This script is using AWS Glue SDK, and the Python library Apache Spark (pyspark).

Python code corresponding to the base Glue Job template

Even if you are not familiar with Spark, what you can notice here are the four main parts :

  • Job configuration, where we are creating the Glue job in itself and associating the configuration context
  • The datasource(s), where we extract data from AWS Services (Glue Data Catalog or S3) to create a dataframe. This is the Extract from ETL.
  • The transform(s), where we apply operations and do feature engineering with our dataframe. This is the Transform part.
  • Finally, the datasink(s), where we export (load) our processed dataframe to a destination (S3 bucket), with a defined type (JSON, CSV or Parquet). This is the final Load part.

Tip : Inside Glue Studio you cannot manually edit your Python script. To do so, you will have to go back to your classical AWS Glue interface and go to the “Jobs” tab.

Editing your job

Back to our visual editor. In our case we are using the Glue Data Catalog we created in our previous article to configure our data source. We just have to click on the “Data source — S3 Bucket” node, select our Glue Data Catalog Database and Table and everything is done ! We also could have not used Glue Data Catalog and directly provided the S3 path to our raw data. Glue Job is able to read multiple entries inside a directory and base the schema of its data on a few entries.

Changing data source properties

Now, we just have to continue creating the ETL graph. In our case, we just applied a mapping to rename fields, and used custom Spark SQL code to remove duplicates in our dataset.

Finally, we output the data to our destination S3 bucket. We recommend selecting the option to automatically “create a table in the Data Catalog”. By doing so, we will have a strict definition of our dataset schema and be able to query data in AWS Athena (or use this schema in other AWS services).

Glue Jobs can be really simple as really heavy. For instance, our user extraction job has only two operations and only requires one dataset, while our video extraction job needs 4 different sources and multiple operations to change and merge data between sources.

A simple job : User extraction

Tip : When merging data sources you need to have a key on which you can join data (just like SQL). Problem is, with events from Mixpanel you often have a set of attributes which are common to every event and which you do not really care about. These columns end duplicated. Thus, we recommend always cleaning as much as possible your data before joining with another data source. For example, to create our video entity we need data from events like “video-watched”, “video-liked”, “video-disliked”, “video-consulted”. In each event we really need only one or two attributes. There are attributes like “mp_lib_version” or “mp_processing_time_ms” which are purely metadata provided by Mixpanel that we almost never need (and there are a lot of them).

Here is below an example of a heavier graph, with multiple data sources and only one destination. Each time we handle duplicates by renaming or dropping fields, and we finally select and aggregate data with the Spark SQL node.

A bigger job : Video extraction

Job details

Before validating your job, you need to configure final details on the “Job Details” tab. Most of these fields can be left default, but the IAM Role and Job Bookmark must be changed or consciously updated.

Your Job’s details edition page

For the IAM Role, make sure that your role has the right policies attached (must have access to Glue, must have access to all S3 sources and destinations). If you want to have logs of your tasks, do not forget to attach the AWS Cloudwatch policies to the glue role, and to create the log groups in Cloudwatch.

The Job Bookmark enables your job to remember where it stopped last time (to prevent re-processing of older entries). It is an interesting feature, but in some cases it breaks your job or provokes bad executions. We often disable this option as our dataset is small and we prefer re-processing everything.

Your ETL Job is now ready to start ! You just need to press the Run button. You can either keep the Glue Studio interface and use it as a high-level one, or you can go back to AWS Glue main interface and select on the left panel the tab Jobs. It provides an uglier but richer interface. You can better see metrics, reasons of potential failures, etc … .

AWS Glue “classical” interface. A richer one ?

Once your job has run, you can go check the output either in S3, or in AWS Athena. Athena is an interactive SQL query service, able to read and write data from Data Lakes. It is very powerful and useful when you want to access or aggregate massive amounts of data. It requires your data to be structured and referenced in a Glue Data Catalog. If you asked for your last node of output of your Glue Job in Glue Studio to create a Data Catalog entry, then you will be able to query the output of your job in AWS Athena. Nevertheless, Athena is not able to correctly understand your CSV file and you will have to do some tricks to have a correct output. If you want to use Athena, we recommend using parquet or JSON output instead of CSV.

AWS Athena’s query console

Tip : If something went wrong and if you activated your AWS Cloudwatch console for this job, you can mainly check the “Logs” link in your job run, and select the “[…]-driver” log entry. The “Output” is the full output containing Apache Spark’s log (which are almost never useful), and the “Error logs” contains raw log output which is hard to read and is (too) often concerning the framework’s inner exceptions.

Bonus : I want my output to be a single file

Our Glue Job outputted partitioned data, in multiple files. To configure your output to only be one file, we have to directly modify the Job’s script. In the classic user interface of your Jobs, select your job and click on “Script” (as shown in the previous figure). Here you have your templated and commented script. What we need, is to tell Spark to change the way it is partitioning our last dataframe to be sure it will not take anything more than one file as output. It is done through the repartition method, which takes an integer as input (the number of partitions), and repartitions accordingly.

At the end of this script, Glue templated the destination node as “DataSink”. It is this node which will get the output configuration and finally write the dataframe to it. Therefore, what we just need to do is to modify the line where the DataSink executes the writing to take the repartitioned dataframe instead of the original one :

From a plural output to a single one

And here you have it ! Your output is now one single CSV file.

Tip : Be certain to do this modification after validating all your job on the Glue Studio side. Modifying any attribute of any node in the Glue Studio version of this script erases any previous manual modification of the entire script, and re-writes from the beginning the templated script.

And what about the prices ?

AWS Glue has a pricing based on what you use, on a hourly rate (billed by the second). It calculates how you are using resources with the DPU : Data Processing Units. One DPU is litteraly a workload with 4 vCPUs and 16 GB of RAM allocated. Every Apache Spark job needs a minimum of 2 DPUs to work, and Glue allocates 10 DPUs by default. Each job runs for at least 1 minute. The price is around 0.44$ per DPU, per hour (depending on your region). Hence the following formula …

Because we like maths, the formula to estimate your Glue cost

In our case, we designed 5 different Apache Spark Glue jobs. We suppose that each one is going to take around 5 minutes to run (300 seconds), and is consuming 5 DPUs (the DPU number is totally arbitrary, we might not need these 3 additional DPUs). We end up with :

Which is equal to 0.91$. This is for one run of all of our jobs. If we suppose that we create schedules that run them after each time we receive data from Mixpanel, which can be on a hourly basis, this price can rapidly grow. For instance, for a month of every hour run : 0.91*24*30 = 660$.

Conclusion

This step is ideally simple. But in our case it was simple because the process in itself and the entities we need to transform are simple. Also, beforehand, we designed what we were needing and how jobs were supposed to look like. Do not hesitate to use AWS Athena, which ends up being really useful when checking your jobs inputs and outputs.

If you are interested in implementing a recommendation engine with AWS Personalize, you should go to the next article of this series : A Recommendation playthrough : creating and exposing an AWS Personalize model.

Thank you.

I hope you enjoyed this article! Please share your comments with us in the section below👇

BeTomorrow is a consulting, design and software agency. Click here to have a look at what we do!

--

--