Serverless Analytics? Start here 💎 AWS Glue, Athena & QuickSight

Data Scientists hate him! Starting your big data pipeline is now easy & cost effective!

Nathan Malishev
Localz Engineering
6 min readMay 10, 2018

--

Running your online analytical processing (or OLAP) is now super easy to do on the AWS platform. You’ll be able to analyse trends over your data set, find anomalies, trends or just run regular reports. I’ll show you how to get started with your AWS Alb logs in under 5 minutes.

Arsenal

AWS Glue is an ecosystem of tools, that easily lets you crawl, transform and store your raw data sets into queryable metadata. Described by AWS as a ‘fully managed ETL service’.

AWS Athena is an interactive query service, built on top of Facebook’s Presto. It lets you query data straight out of S3 buckets, that is structured or unstructured. It is massively scalable, and is also serverless making it drama free to use. It is a great tool for exploring datasets & running ad hoc reports.

AWS QuickSight is an interactive dashboard that makes it super easy & fun to query your data. And all the data nerds can get off to it!

These AWS services are the fastest way to get up and running with your online data analytics processing. The other OLAP offerings by AWS like Redshift or Redshift spectrum are designed for use after data exploration and useful data extraction has occurred. Thus known and consistent queries are used and reports are generated a lot faster. It’s faster because it’s not serverless & expensive! There’s also Kinesis Analytics and it is used for realtime data analysis and aggregation.

Can you have too much big data?

Flow

Despite the complex looking diagram below, the flow is relatively simple! At it’s core it has 3 components. Data storage in S3, data transformation in Glue and then data visualisation via Athena & QuickSight.

S3 → Glue → Athena

Lets get started

1First we need to generate our data set. In our case we are going to be using the AWS generated ALB logs. First enable this in the EC2 console, under the load balancers selecting your load balancer & hitting attributes.

Edit attributes and select a bucket to output your logs

2 Now we have the data, we need to structure it enough that Athena can query it. This is where Glue comes in. First it provides data crawlers that use inbuilt and/or custom classifies to try and to parse your logs. The crawler will traverse your specified S3 files and group things by classifier into metadata tables in AWS Glue. These tables are later used by Athena for querying.

First we must add a custom classifier in Glue, as strangely ALB logs are not recognized. You can use mine below. I wrote the classifier in grok, which at first seems overwhelming but really its just regex. This grok debugger, helped out a lot.

My Grok Pattern — View the source to get the ‘Custom Patterns’

Next create a crawler using the classify, and select the input source, as the root directory of the ALB logs we enabled before. This will parse all the log files under the given key. After you run the crawler, a table is created which stores all the meta data about the ALB logs and lets you query them.

At this point you could actually query your data, but best practice is to run an ETL (Extract Transform & Load), to transform your data so it’s more efficient to query later on. Running an ETL and another crawler will off-course more money, but as you will see later on the savings when it comes to querying is well worth it.

Savings with Parquet vs CSV - Credit:https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and

3The data format chosen is Parquet. Parquet has been selected as it’s columnar, reducing the amount of data that needs to be scanned on queries.

We transform our data set, by using a Glue ETL. Select the previously crawled ALB table as your input and you must select a new output directory that is empty. Edit the schema and be sure to fix any values, like adding the correct data types. This creates a python ETL script which you can edit, to fetch other keys from remote sources or clean data if you need.

If you’re also transforming lots of data you can use partition keys to output the data under a hierarchal structure, so it’s more efficient to query later.

Parquet has been selected as it’s columnar and on large data reduces scanned data and reduces query time.

In your ETL you can fix up your schema!

4Okay all done! This is where the fun begins! We get to start querying the data. Using Athena you can use the query editor to start writing your SQL queries.

You can fly with Athena, if your SQL skills are up to scratch

You can run any SQL queries you have the skill to write! Great for developers and BA’s alike.

A typical set of results

5The problem is I suck at SQL and that is probably the most complicated query I can write on my own. This is where QuickSight comes in, it’s easy to use and does a great job at visualisation. Under the hood it uses Athena to run its queries and you can keep track of them all in Athena’s history.

Doesn’t everyone love a good graph

Using QuickSight you can easily analyse our transformed data. The dashboard is drag and drop, has your typical x and y columns as well as a group by feature. It’s super intuitive and satisfies my inner data nerd.

You can get some pretty awesome metrics out of QuickSight with a drag and drop!

QuickSight is an awesome tool, you can share share dashboards, create certain views, set global filters & run your ad hoc queries!

At Localz we have found QuickSight a great way to uncover any abnormal trends — be sure to check it out.

Caveats

Athena Pricing

Athena is priced per query by the amount of data you retrieve. Currently at 5$ per 5TB.

The below is two identical queries, the first run over a JSON catalog and the second a Parquet catalog.

JSON vs Parquet. Take note of data scanned

Interestingly I went a step further and created two identical dashboards, that had 7 pretty graphs. I ran these dashboards agains the JSON & Parquet data sets. JSON scanned 1.6gb of data, essentially the whole data set on each query. While the Parquet format only scanned 44mb. You can see our basic conversion from JSON to Parquet has saved us over 35 times in querying cost.

Our basic conversion to Par

Despite the extra crawler & ETL job costing an extra $0.20 (in my case), it pays for it self after 1 or 2 queries.

Running these data pipelines aren’t super cheap but considering you can do it in 5–10 minutes with little to no expertise, it pays dividends!

Thanks for reading! Be sure to give it a clap if you enjoyed it!

Be sure to check out Localz engineering! And if you’re interested in an exciting and fast paced place to work check out Jobs @ Localz.

Further Reading

--

--