Data Engineering of Yelp Data Set using AWS Analytics

Published in

Data Engineering on Cloud

9 min readApr 1, 2022

As part of this article let us go ahead and see how one can take care of Data Engineering of Yelp Data Set using AWS Analytics Services such as s3, Glue Catalog, Athena, etc.

Data Engineering and Data Science using AWS Services

In this article, we will primarily demonstrate the following.

Ingestion of files into AWS s3 using AWS boto3 (A Python based library to manage AWS services and their components).
Create Glue Crawler to crawl through all the folders and files in s3 to create AWS Glue Catalog Database and tables. A table will be created for each folder.
Run ad hoc queries using AWS Athena leveraging Glue Catalog Tables.
We will also talk about how Data Engineering Pipelines are typically built using AWS Analytics Services such as AWS EMR.

This article is part of Data Engineering on Cloud Medium Publication co-managed by ITVersity Inc (Training and Staffing) and Analytiqs Inc (Data Services Boutique)

Pre-requisites

Here are the pre-requisites to take care of setting up the environment to analyze yelp data set using AWS Analytics Services.

Setup AWS CLI for the ability to run AWS commands. We should be able to validate files in s3 using AWS CLI. We can also manage the services using AWS CLI.
Setup Python Virtual Environment and Install required dependencies. We will be able to isolate all the dependencies required for this effort such as boto3 at one place.

Overview of boto3 to manage AWS Services

Let us get an overview of boto3 which is a Python SDK to manage AWS Services.

AWS exposes REST APIs to manage AWS Services.
Over a period of time, they have exposed SDKs using popular programming languages such as Java, Python, Scala, Java Script, etc.
The Python based SDK is popularly known as boto3.

Here are some of the common use cases for boto3.

Manage AWS Services such as EC2, s3, etc.
Uploading or writing files to s3.
Downloading or reading files from s3.
Manage Glue Crawlers, Databases and Tables.
Interact with AWS Managed Databases such as DynamoDB.

For this demo, we will primarily use boto3 to upload files to s3 using Python as programming language. We can install required plugins and can leverage Pandas APIs to read files from s3 as well as write files from s3.

Upload files to s3 using boto3

Let us understand how to upload files to s3 using boto3. We will upload all the files related to Yelp into a folder called as yelp-dataset-json. You can pick bucket of your choice under your AWS Account.

First you need to make sure that files are downloaded from the appropriate source (kaggle).
Once the files are downloaded, we can upload the files using AWS CLI or boto3.
Here is the function which will write a Dataframe for a subset of the data from json file to s3. The data will be compressed using gzip.

def write_to_s3(file, df, s3_client):
    dir_path = file.split('/')[-2]
    file_name_suffix = str(uuid.uuid1())
    df.to_json(
        f's3://itvyelpws/yelp-dataset-json/{dir_path}/part-{file_name_suffix}.gz', 
        orient='records', 
        lines=True
    )

Here is the logic which will read all the json files from given folder using Pandas. As the files are quite big, we will be reading 100,000 records at a time to write to s3 in the form of JSON. For each 100,000 records we will invoke write_to_s3 function.

import glob
import boto3
s3_client = boto3.client('s3')
dfs = []
files = glob. \
    glob('/data/yelp-dataset-json/*/*.json', recursive=True)
s3_client = boto3.client('s3')
for file in files:
    json_reader = pd.read_json(file, lines=True, chunksize=100000)
    for chunk_id, df in enumerate(json_reader):
        print(f'Processing chunk {chunk_id} in {file}')
        write_to_s3(file, df, s3_client)

Data Ingestion can take some time to load into the Data Lake (s3 in this case). We can implement DIY Parallelism to spawn multiple threads to write to s3. It is covered as part of Data Ingestion into s3 using Python Boto3, Pandas and Multiprocessing modules.

Once the data is uploaded to s3 in the form of compressed JSON files, run this command to validate.

aws s3 ls \
    s3://itvyelpws/yelp-dataset-json/yelp_academic_dataset_business/

Overview of Glue

AWS Glue is fully managed service from AWS which have several Data Engineering Components.

Glue Crawler — It is used to crawl the data and quickly create tables which can be used by other AWS Analytics services such as Athena, EMR, Glue Workflows, etc.
Glue Catalog — The metastore of Databases and Tables. It is similar to Hive or Spark Metastore.
Glue Jobs — Data Engineering Pipelines using wrappers on top of Apache Spark.
Glue Studio — A GUI based interface to work with AWS Glue Components.

Here are the steps to get started with Glue.

Login into AWS
Search for Glue and click on Glue

Here is the Glue Console page.

As part of this article, we will primarily focus on Glue Crawler and Glue Catalog Tables. The Glue Catalog tables that will be created as part of this article will be used via Athena for ad-hoc analysis as well as ML Model Development.

Create Glue Catalog Tables using Glue Crawler

Let us understand how to create Glue Catalog Tables using Glue Crawler using AWS Web Console.

Login into AWS Web Console.
Search for Glue and get into Glue Console.

Create Crawler pointing to the location yelp-dataset-json in s3. Glue Crawler can crawl all the subfolders and can automatically create table for each folder.

You will redirected to Review all steps. Make sure to review all the steps and then click on Finish.

Now the crawler is created. We can run it to create all the tables in the specified Database. We can also run later by selecting appropriate crawler and then by clicking on Run crawler.

The structure of each of the table will be determined based on the JSON formatted data in the files.
Here are the details of the database and the tables that are created using Glue Crawler by name Yelp Dataset JSON Crawler.

Database Name: yelp_json_db
Tables: yelp_academic_dataset_business, yelp_academic_dataset_checkin, yelp_academic_dataset_review, yelp_academic_dataset_tip, yelp_academic_dataset_user

Overview of Serverless SQL using Athena

AWS Athena is a fully managed service to run SQL queries at scale for ad-hoc analysis, data processing as well as visualization.

Here are some of the capabilities of AWS Athena.

Run ad-hoc queries using AWS Athena Editor as part of AWS Management Console.
Build data engineering pipelines using programming languages like Python, Shell Scripts, etc.
Connect visualization tools such as Tableau, Quicksight, etc for enterprise reporting.
Develop Machine Learning Models using ML Tools such as AWS Sagemaker, Data Wrangler, etc.

Athena will facilitate us to run queries against structured data via Glue Catalog Tables.

Getting Started with AWS Athena

Here is how you can get started with Athena to run queries at scale using ANSII SQL Syntax. The core syntax is derived from Hive QL or Spark SQL Syntax.

Login to AWS Management Console
Search for Athena and click on the service to go to Athena.

Click on Explore the query editor

First time query users need to click on View settings to take care of configurations such as location to save query results.

The s3 location can be specified by clicking on Manage under Settings.

Specify relevant s3 location and optionally owner for s3 bucket and then click on save to configure Athena. Once done, we should be able to run Athena queries.

s3 prefix is nothing but a folder if ends with / or prefix to a folder name.

Ad-hoc Data Analysis using Athena

Once the Glue Catalog tables are created, we can leverage Athena for Ad-hoc analysis. Athena is a server-less SQL engine and the resources will be scaled up and down based on the data that is supposed to be processed for a given query.

Here are some of the queries which are used to analyze the data.

What are the number of reviews we have in the data set?

SELECT count(*)
FROM yelp_json_db.yelp_academic_dataset_review

How many unique businesses reviewed and also what is the unique count of users or reviewers?

SELECT count(distinct business_id) AS business_count, 
    count(distinct user_id) AS user_count
FROM yelp_json_db.yelp_academic_dataset_review

Get number of reviews and average star count for each business?

SELECT business_id, count(*) as review_count, 
    round(avg(stars), 2) as avg_stars
FROM yelp_json_db.yelp_academic_dataset_review
GROUP BY business_id
LIMIT 10

Get the number of reviews for each stars

SELECT stars, 
    count(*) AS star_count
FROM yelp_json_db.yelp_academic_dataset_review
GROUP BY stars
LIMIT 10

Data Engineering Pipelines using AWS

There are multiple ways Data Engineering Pipelines using AWS Services.

Glue Workflows
Amazon Elastic Map Reduce (EMR)
Lambda Functions

Let us get into few details related to Data Engineering Pipelines using AWS Elastic Map Reduce (EMR).

EMR is a fully managed service from AWS.
It have all the Big Data services such as HDFS, YARN, Spark, Hive, Hue, etc installed and pre-configured.

Here are some of the key features of EMR.

APIs to manage clusters — We can integrate with any scheduling tool using APIs to provision on demand clusters when needed.
Well integrated with other AWS Services such as Glue Catalog, Lambda Functions, Step Functions, etc.
Ability to scale up and scale down based on the amount of data being processed.

Once the data is analyzed, we can develop Data Engineering Pipelines based on the needs and deploy on EMR. The jobs can either be scheduled using AWS native services such as Events Bridge or external tools such as Airflow.

Conclusion

As part of this article we are able to achieve the following:

Ingest data into AWS s3 leveraging boto3
Create Glue Catalog Databases and Tables using Glue Crawler
Ad-hoc analysis of data using SQL queries leveraging AWS Athena.