Serverless Analytics, Part 4: ETL, and Why You May Not Need Crawlers
This is Part 4 of a 5-part series about getting started with serverless analytics. In case you missed it, you can catch up on the earlier parts of the series through the links at the end of this article.
The basic serverless analytics architecture in this series can help you jumpstart your analytics capability. It provides you with the ability to gain insights from terabytes of collected data, without requiring a huge opex commitment (i.e., not requiring you to pay for a huge monthly bill for that analytics solution). This is especially useful while you are still experimenting and getting started, and even as you have to deal with terabyte-level data sizes.
Where we left off from Part 3
In the previous article, we focused on the Cataloging phase of the basic serverless analytics architecture, highlighted above. The goal of the Cataloging phase is to create and maintain metadata tables (each one a data catalog table) of your data residing in S3, so that you can efficiently query them through SQL as if they were actual database tables, as opposed to just data sitting in an S3 bucket. And by efficiently query, we mean the ability to execute queries that are fast (performance-wise) and cheap (cost-wise).
We talked about how to create a data catalog table that matches the partitioning scheme of the example scenario from Part 2. We created this table manually so we can customize the column names with ease.
We then talked about creating a crawler, which is an automated mechanism to make sure our data catalog tables are updated whenever our S3 buckets get new data. Since our data catalog tables don’t actually hold data (your actual data is in S3) and instead just metadata and partitions, then there needs to be a syncing mechanism so that when physical partitions are created in S3 (e.g., new data is dumped into it at the end of the day), the new partition information is also added to our data catalog table. Crawlers often serve as the de facto syncing mechanism, and as I mentioned way back in Part 1, this isn’t necessarily true. Crawlers are a good visualization of the concept of data cataloging, though, so I chose to emphasize them in the basic serverless analytics architecture.
Now, in Part 4, we’ll talk about the Transform phase, which means Extract, Transform and Load (ETL) jobs. We also finally get to revisit and (hopefully) close the issue of “Do I need crawlers at all?”
The Transform Phase
The Transform phase of the basic serverless analytics architecture is highlighted above.
Our goal here is to take data from our Raw bucket and, through an automated mechanism, transform them into a more structured and more analytics-friendly form (think “SQL-friendly” given the types of queries you or your users want to do). This transformed data is sent to a different bucket, which is labeled “Processed” above, since it contains exactly that — our processed data.
This is where Glue ETL Jobs come. An ETL Job in Glue is exactly the automated mechanism for data transformation that we need for the Transform phase.
Glue Jobs for ETL
A Glue job can be created through AWS Glue Studio, and this is the simplest and least-effort way to create generic ETL jobs:
- Go to AWS Glue -> AWS Glue Studio -> Jobs
- Create a Job through the visual editor in Glue Studio — in our case, we can stick with the default option of “Visual with a source and target”.
- Since our Transform phase deals with bucket-to-bucket transformations (our Raw bucket being the source, and our Processed bucket being the target), we can also stick with the default Source and Target settings, both of which are “Amazon S3”
The image above captures the default settings mentioned. We’re ready to create this new job, so we simply hit that big orange “Create” button.
You’ll find yourself in the visual editor with some some pre-defined elements thanks to the default setting of “Visual with a source and target”. That looks something like this:
The important thing to understand in the visual editor is that the nodes (the rectangles that are connected to each other) represent steps in your ETL job. You can have a different number of nodes, they can be of different types (one of “source”, “action”, or “target”) and they can be connected to each other as required by your ETL job. It’s essentially a simple flow chart that lets you do high-level tinkering for the ETL job.
Let’s resume dealing with the scenario that we’ve been handling in Part 2 and Part 3 and try to see how we can create ETL jobs for it.
Our Raw bucket contains data from Sales, IT, and Customer Support, and is properly partitioned using Year/Month/Day (laid out in Part 2). We then created three data catalog tables (one for Sales data, one for IT data, and one for Customer Support data), also partitioned by Year/Month/Day. Now, our goal is to get data from Raw (Extract), process them to turn them into a more analytics-friendly structure (Transform), and then place them into the Processed bucket (Load).
As you probably guessed, the three nodes created by default for us maps cleanly to the ETL process we desire:
- The source node “S3 bucket” is the Extract portion of the ETL process.
- The action node “Transform — ApplyMapping” is the Transform portion ETL process.
- The target node “S3 bucket” is the Load portion of the ETL process.
Let’s see how we can do that using the Glue Studio visual editor.
Extract step / the source node
In the Extract step (the source node), all we really need to configure is the data source for our ETL job.
Conveniently, since our needs are bucket-to-bucket, a quick way to get our Extract step done is simply to specify the S3 location and data format for our data. In our case, since we used “collection=<something>” as our primary partition to isolate data from IT, Sales and Customer Support from each other, then our S3 location should choose which collection we want, i.e., the “folder” for the collection that we are targeting now. Since our data is all CSV files in our example scenario, we just choose CSV as the data format. We now have a functioning Extract step.
However, remember that we actually made crawlers for our Raw bucket — this means we can actually do something better (and easier) than that.
Instead of our data source being an S3 location, we can choose a Data Catalog table instead, and point it to the correct Data Catalog table. This way, when we extract from source, our ETL job will already have access to metadata, and will know the correct column names. If we used a plain bucket as source, we would still get the same data, but since there wouldn’t be metadata, we would have to go through the trouble of manually changing key names (column names) in the output schema of our Extract step in order for the proper column names to be reflected in the next phase (Transform).
Transform step / the action node
In the Transform step (the action node), what we need for now is to define the mapping of source data (the columns from last step) to target data (i.e., how the data from source will be formatted in the target bucket). When you click on the Transform node, you’ll see something like this:
Here’s what’s happening in that example I made:
- The source key column contains the column names inherited from the Extract step earlier. I just named them col0 to col5 for simple illustration.
- The target key column contains a textbox where I can rename what the new column names will be. I don’t have to rename them, I can just make them match the source keys if that’s what I wanted. In this case, I gave them different names, for illustration purposes.
- The data type column allows me to specify the appropriate data type for the column.
- The drop column is where I can indicate that a particular column from source is not needed for the transformed target data, by ticking that checkbox. In the example above, I decided that col1 and col2 from the Extract step (i.e., from my raw bucket) aren’t data that I want in my processed bucket, so the Transform step strips them out.
That’s pretty much it for the Transform step — we can rename columns, specify data types, and drop columns outright.
Keys vs columns
A note on the “key” vs “column” terminologies: You might be wondering why the Glue Studio interface (as seen in the screenshot) calls the columns “keys” (e.g., “source key” and “target key” instead of “source column” and “target column”), even though from our Data Catalog these keys are clearly the columns. You might also be wondering why I keep calling them “columns” instead of “keys”.
While the short of it is that you can just safely consider them interchangeable terminologies, the longer answer is that since we are dealing with transforming data into an analytics-friendly form, we are usually dealing with “columnar” data formats instead of “row-based” data formats.
In a traditional database that uses a row-based data format (they are “row-stores”), data is stored in rows, and the different fields of data are the columns. This is what you see in an introductory database class. If you had 100 sales transactions, and each transaction has 10 fields (customer, item, discount, purchase date, price, etc.), then the data would be stored as 100 rows with 10 items each (the different fields) as columns.
In column-store database (typical for analytics use), it’s the opposite. Since analytics is about fast aggregations, we want to minimize the number of records we need to read. . If you had 100 sales transactions, and each transaction had 10 fields (customer, item, discount, purchase date, price, etc.), then the data would be stored as 10 rows with 100 items each. Each of the 10 rows refers to the fields, so the first row would contain all 100 customer values, the second row would contain all 100 item values, the third row would contain all 100 discount values, and so on. It’s stored this way so that when we ask our analytics engine to give us the total transaction values, for example, then all it has to do is read one row — the row that has all 100 values of the data that we need, such as the transaction price — and then immediately be able to aggregate the data. That’s much faster than in a typical row-store setup, where the engine would have had to read 100 different rows instead of just 1.
And that’s why the columns are called “keys” in the ETL jobs — this is so it’s less ambiguous. “Key” means this is what you call the piece of data, regardless of whether your data will actually be formatted as row-stores (in which case the keys are columns) or column stores (in which case each key is actually its own row).
As for me calling them “columns”, that was just because “column” is the terminology that most people who aren’t already analytics experts will understand, and I wanted to explain the basics first in familiar terminology before delving into the nuance of keys and column-stores and row-stores.
Digression over, hopefully that made sense. Let’s move on.
Load step / the target node
In the Load step (the target node), we specify where the transformed data will be stored, and the data format for that data.
Let’s go through what’s happening In the example image above.
First, we specified the data format for our ETL output. In this case, we chose Parquet + Snappy compression. This is the sensible default you should use, unless you specifically need something else. Parquet is a columnar format (friendlier for analytics) and is splittable, meaning Athena wouldn’t need to read the entirety of a giant Parquet file if the query only needed a few columns from it. Using Snappy compression also helps keep our costs low — it saves space in the bucket (less S3 storage costs) and Athena scanning costs (Athena charges per GB scanned, so you’ll save a bundle just by having your Parquet files compressed using Snappy).
Then, we specified the target bucket location. In the example above, I just specified the bucket itself, meaning everything will be placed in the root path of the bucket. If I wanted, I could have specified as many levels of folders as needed, which is useful if you had many different ETL jobs working on different types of data, and you want them to dump data into different folders (S3 prefixes).
Next, you’ll see there are options about updating the Data Catalog. I chose the option that is compatible with a pre-existing Data Catalog table — no updating of schema, but updates partitions. This is because what I usually do is create a Data Catalog table manually (to have full control of column names and types), and then just have that table’s partitions updated with a crawler or ETL job. I haven’t explicitly mentioned creating a Data Catalog table for our processed data yet, but the steps and reasoning there is pretty much exactly the same as what we went through in Part 3.
Of course, since this ETL job’s final step involves updating the Data Catalog table (you can see we need to specify the proper partitions there), this means we don’t actually need a crawler here anymore. This is what I meant at the end of Part 3 — sometimes we don’t need crawlers, because we can just fold them into our ETL jobs. Avoiding unnecessary crawlers whenever possible is strongly recommended. You have less moving parts to worry about, plus you are guaranteed that your ETL job will not just load and transform data as needed, but will also automatically properly catalog it.
And that’s it! You now have a minimally viable ETL job that will happily convert your CSV files to Parquet+Snappy, save them into the target bucket (the processed bucket in our architecture diagram), and update the relevant Data Catalog table for it. You can also set a schedule for your ETL job — you’ll see a Schedules tab in your Glue Studio Visual Editor interface, and it’s a simple matter of deciding the frequency and timing.
Wrap up
At this point in the series, we covered the basic serverless analytics architecture, how to get data to the Raw bucket for consolidation, how to organize them appropriately into useful partitions, how to catalog the data, and how to transform collected data into a processed form that is directly useful for our analytical queries.
We can finally do Athena queries on both our raw and processed buckets — quite the milestone! You’ve got yourself a nice, fully-working, serverless analytics solution now.
In the next part, we’ll cover some optimization techniques to help you make faster and cheaper queries for that brand new serverless analytics solution of yours!
ICYMI, here are the other parts of the Serverless Analytics series:
- Part 1: The Basic Serverless Analytics Architecture,
- Part 2: Data Collection and Understanding Partitioning.
- Part 3: 80% Faster and 90% Cheaper Queries Through Cataloging and Crawling.
Other articles you might be interested in:
- Since we’re on analytics, are you interested in Redshift? Try Optimizing Data Warehouse Performance and Storage Costs Through Distribution Styles.
- Infinite Scaling w/ DynamoDB — learn how to unleash the full power of DynamoDB for all your serverless applications.
- Basic Serverless Web App Architecture — an architecture that costs nothing! If you are interested in serverless analytics, I hope you are also interested in a fully serverless web app architecture.
UPDATE: All parts of this series are now available!
- Part 1: Cheap and Scalable Terabyte-Level Analytics
- Part 2: Collection Phase and Partitioning
- Part 3: 80% Faster and 90% Cheaper Queries
- Part 5: Query Optimization
If you liked this article, please follow me to receive notifications for new articles, and click the clap button a few times so that the Medium algorithm will know to give you more of this type of content. Thanks!