Data Engineering Project using AWS Lambda, Glue, Athena and QuickSight

14 min readSep 1, 2023

In this project, we aim to securely manage, streamline, and perform analysis on the structured and semi-structured YouTube video data based on the video categories and the trending metrics.

Architecture :

As can be seen in the Architecture diagram, we first load the dataset onto an AWS S3 bucket. To process the given data properly, we would require doing certain transformations on the data. We would achieve this using Glue ETL and AWS Lambda. Once we get the cleaned and transformed data, we will store it in another bucket. Finally, we would join the two separate tables resulting from the above steps, to create the final table. We would also follow this by visualizing the data created using AWS QuickSight.

Project Goals :

Data Ingestion — Building a mechanism to ingest data from different sources.
ETL System — Transforming raw data into a proper structured format.
Data lake — We will be getting data from multiple sources so we need a centralized repo to store them.
Scalability — As the size of our data increases, we need to make sure our system scales with it.
Cloud — We can’t process vast amounts of data on our local computer so we need to use the cloud, in this case, we will use AWS.
Reporting — Build a dashboard to get answers to the questions we asked earlier.

Technologies Used :

Amazon S3: Amazon S3 is an object storage service that provides manufacturing scalability, data availability, security, and performance.
AWS IAM: AWS Identity and Access Management (IAM) is a service that manages user identities and their access to AWS resources.
QuickSight: Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud.
AWS Glue: A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
AWS Lambda: Lambda is a computing service that allows programmers to run code without creating or managing servers.
AWS Athena: Athena is an interactive query service for S3 in which there is no need to load data it stays in S3.

Dataset Used

This Kaggle dataset contains statistics (CSV files) on daily popular YouTube videos over the course of many months. There are up to 200 trending videos published every day for many locations. The data for each region is in its own file. The video title, channel title, publication time, tags, views, likes and dislikes, description, and comment count are among the items included in the data. A category_id field, which differs by area, is also included in the JSON file linked to the region.

https://www.kaggle.com/datasets/datasnaek/youtube-new

Project :

Let's move on to implementing the project.

Step 1: Creating an Admin IAM user ( to separate from root )

On the AWS Console, log in to your root account.
Navigate to the AWS IAM page and click on Users and then create a new user.
On the Create New User page choose the option to get console access as well and set up login credentials accordingly.
On the next window, provide AdminAccess to the User being created.

Once this is done we would be prompted to review and create the user.

Once the user is created, log out of the root account and log in to the User created using the credentials created in the above steps. We would create Access Key pairs for this user and use them to log in from AWS CLI.

Step 2: Setting up AWS CLI

Download AWS CLI onto your local system and install it. Once AWS CLI is installed on your system, we write the following statement.:

aws configure

Enter the Access key pair as prompted and this would log us into the user we created in the previous step.

Step 3: Uploading the data onto AWS S3

First, we would create a new S3 bucket to hold this raw data.

Now on the terminal navigate to the Dataset folder and write the following commands

aws s3 cp . s3://[YOUR BUCKET NAME]/youtube/raw_statistics_reference_data/ --recursive --exclude "*" --include "*.json"

aws s3 cp CAvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=ca/
aws s3 cp DEvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=de/
aws s3 cp FRvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=fr/
aws s3 cp GBvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=gb/
aws s3 cp INvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=in/
aws s3 cp JPvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=jp/
aws s3 cp KRvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=kr/
aws s3 cp MXvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=mx/
aws s3 cp RUvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=ru/
aws s3 cp USvideos.csv s3://[ YOUR BUCKET NAME ]/youtube/raw_statistics/region=us/

This would upload the dataset we downloaded from Kaggle onto the s3 bucket we created for the raw data.

Transforming the .json data to parquet format

Step 4: Creating a Schema based on the data ( .json files ) using AWS Glue Crawler

For the first step, we will crawl over the folder containing .json files.

To create a Crawler :

Go to AWS Glue and choose Crawlers from the sidebar menu.
Click on the Create Crawler button and follow the steps of the crawler creation.
Mention the data source to parse when prompted. This data source in our case would be the bucket we created in the last step.

Again since we have to access S3 from Glue, we need to set permissions for these resources. To do this we create an IAM role for Glue, giving it Admin access ( for our convenience ).

4. Next we create an output database ie. the database that stores the results of the crawler run.

Once all these steps have been followed we are able to create a AWS Glue Crawler and it will be listed in the list of Crawlers available.

Once the crawler gets created, we run the crawler and generate the schema of the datastore it crawls over as a table.

The schema as shown above can be accessed by clicking on the Tables tab of AWS Glue and choosing the table created by the crawler run.

Step 5: Querying the table created using AWS Athena

Open the AWS Athena editor and choose the Database ( created in the above step ) and the table name and write any SQL query you want to run on the data in table form and get results.

There might be an error saying No Target Location found. To solve it, go to the Settings tabs in Athena editor and click on Manage. Create an extra S3 bucket to store the temporary Query results as prompted by the page.

This would result in the error shown below :

This error comes due to the fact that the items column of data is in array format and the Glue tools are unable to read this format of data. To rectify this error, we would convert this .json data into parquet format. To do this we will use an AWS service called Lambda.

Step 6: Transforming the data using AWS Lambda

AWS Lambda is a serverless computing service that allows you to run code in response to events without provisioning or managing servers.

We will create a function in AWS Lambda. This can be done easily by following the Create Function page. Specify Python as the runtime environment. We will have to create an IAM role, providing access to Lambda for S3 and Glue. Once this is done, paste the Python code in lambda_function.py of this repository to the code tab of the Lambda console.

The above code uses the pandas function json_normalize to convert the JSON data in the form of an array to a column ( parquet ) format. Once this is done the next lines write this dataframe as a parquet file onto the bucket and the folder specified by the environmental variables.

The values of these environmental variables can be set by going onto the Configuration tab and then the Environmental Variables as shown below.

The s3_cleansed_layer environmental variable specifies the bucket in which the data would be stored. For this create a new S3 bucket and copy its URI onto the variable.

Once this is done, we will test the Lambda function. For that, we would press the Test button and create a test. Use the s3_put template in the create test window and specify the bucket and the key in the test JSON provided by the template.

Finally, run the test.

Certain errors that we would get:

The first error that will appear is the absence of certain Python packages that our code uses. For this we use Layers. We would add a layer to the Lambda function from the Layers Pane of the Lambda console. Use the AWSSDKPandas-Python38 layer and run the test again.
The next error would be a timeout error. We will have to increase the timeout duration for our code. This can be done by adjusting the settings in the Configuration tab of the Lambda console.

3. If you get any access not allowed error, we will add the corresponding permissions to the IAM role we created for this particular Lambda function.

4. Finally, we would get db_youtube_cleaned not found error. To remove this error, we will go to the AWS Glue catalog and go onto the Databases tab. On the create database prompt choose the required options and create the database.

Once we follow the above steps, and run the test, we will get a SUCCESS response and a parquet file will be created onto the bucket we created for cleaned data. Also, a table schema would be generated under the db_youtube_cleaned (the database we created to rectify the above errors ).

Hence we have successfully converted nested JSON data, that the glue tools had an error in reading properly into a column-based parquet format which is easily readable.

Step 7: Create a Schema based on the data ( .csv files ) using AWS Glue Crawler. Follow the same steps as were followed for creating the crawler for the .json files.

Once the crawler is created, on running it a table is created.

Step 8: Querying the table created using Athena.

However, trying a join operation between the two tables created by the above steps gives an error.

SELECT * FROM "db_youtube_cleaned"."raw_statistics" a
INNER JOIN "db_youtube_cleaned"."cleaned_statistics_reference_data" b ON a.category_id=b.id

To fix this we will transform the .csv data as well to change the schema such that the datatypes of the columns causing the error are the same.

This could be solved by changing the query as follows :

SELECT * FROM "db_youtube_cleaned"."raw_statistics" a
INNER JOIN "db_youtube_cleaned"."cleaned_statistics_reference_data" 
b ON a.category_id=cast(b.id as int)

However, performing this cast operation on a huge amount of data is very expensive. So, transforming it is the best option.

We would do it by changing the schema of the table using the Glue Console.

On the Glue Console under the Tables option, select the cleaned_statistics_reference_data table and check its schema. Change the datatype of the column named id to bigint and save.

However, running the join query without the cast operation leads to the following errors

This occurs due to the fact that parquet files have metadata stored along with the actual data. Changing the schema in the Glue catalog doesn't change the schema in the metadata of those files and this causes a mismatch.

To fix this delete the parquet file that was created by the test of the Lambda function and then run the test event again. Once it runs, a parquet file is created in the cleaned bucket.

On running the Athena query again, we get no errors and a JOIN operation is possible.

Transforming the .csv data to parquet format:

Step 9: Using Glue ETL to transform data

On the Glue Console, choose the ETL Jobs option from the sidebar.
Choose the Spark Script Editor option and create a job.
On the page that follows go to the Job details tab and fill in the IAM role for S3 and Glue access.
Choose the Type of Job as Spark using Python as the programming language.
Once these settings are saved, go to the code tab and copy the code in pyspark_function.py.

The source of the data would be the raw_statistics data that we created using the Glue Crawler on .csv data. We specify this in Line 26.

Specify the destination of the cleaned data in the following format in Line 51 :

datasink4 = glueContext.write_dynamic_frame.
from_options(frame = df_final_output, connection_type = "s3", 
connection_options = {"path": "s3://[ NAME OF THE S3 BUCKET
 YOU CREATED FOR CLEAN DATA ]/youtube/raw_statistics/", 
"partitionKeys": ["region"]}, format = "parquet",
 transformation_ctx = "datasink4")

The above can also be done by using the Visual Mode. In that alternative, visual GUI is used to create DAGs specifying the extraction, Transformation and Loading steps as Operators in the form of noodes. This would create the pyspark script based on the options you chose instead of you specifying the script exclusively.

For a better understanding of the Pyspark code read about the DynamicFrame library and various arguments provided.

We filter the data from only the regions where the titles are in English ( ie Canada, UK, and US ). This is done to avoid errors that would come up due to special characters. We do this by mentioning a pushdown predicate while loading the data.

Once all this is done, run the spark job created.

Once the job runs, we would observe that the destination bucket and folder have all the cleaned files in parquet format.

Step 10: Crawl the cleaned .csv data created in the above step and create a table.

To this, we would create a crawler in the same way as we have created in the above steps (the data source should be the folder containing the cleaned parquet files created in the above step and run it to produce a table.

After this step, we would be able to make queries on the data using Athena.

Step 11: Apply a trigger on the Lambda function :

Currently, the lambda function we wrote in Step 6 produces a cleaned output file only when we manually invoke it using a test case. But in the real-world solution, the function should be invoked every time a raw file is added to the S3 bucket. This can be achieved using triggers on Lambda functions.

To add a trigger:

Go to the Lambda function to which the trigger has to be added.
Click on the Add trigger button.

3. On the window that follows apply the trigger with configurations as shown in the image below and click Add.

Suffix — .json

Once this is done, adding any .json files to the bucket and prefix folder mentioned in the trigger, would invoke the Lambda function and create clean files in its destination bucket and folder.

Creating the Analytics Bucket and Table:

Once the above steps are completed, we create a final table, containing a join between the two clean tables that we created.

Step 12: Using AWS Glue ETL to join the two cleaned tables created and generate final data.

We would create a Glue ETL job using the Visual with a source and target.

Click Create and on the drawing board click on the + button.

Add two S3 source nodes ( the folders containing the cleaned .json and .csv files in a parquet format ) and specify the details in the node details pane, adjacent to the drawing board.

Now add a Join Transform node and in the details pane, mention the type of join and the column on which it should be done. Also, add both the S3 source nodes as Node parents to this.

Finally, add a S3 target. We will have to create a S3 bucket for storing this joined data. Create the S3 bucket and mention the URI in the details pane.

Also, choose the “Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions” option. You would be prompted to enter a database. For this, we will create a database db_analytics to store the table that would be created as a result of a job run. Mention the name you want for the table made and add any partition keys if you want.

Now, run the job and monitor the job using the Glue Studio.

Once the job successfully runs, the data gets written on the S3 bucket mentioned in the target node and a table is created in the db_analytics database. This table can be queried using Athena.

Now that we have created the final data, we move on to visualization.

Visualization of data Using AWS QuickSight:

Step 13: If using QuickSight for the first time, we will need to Register.

This would require a username and an email.

Once you register your account with Quicksight, we have to provide permissions for the buckets. For this go to the Security and Permissions and make sure Quicksight has access to all of the buckets involved in this project ( especially the Athena bucket and the analytics bucket ).

Now we need to import the dataset. For this, go to the Datasets tab on the Quicksight console. Click New Datasets and then choose Athena. From the list of tables select the analytics version ( the final table we created ) and save the dataset.

Click on the Dataset that we just added and click Create Analysis.

Once all the above steps are followed, we are directed to the Quicksight page with a drawing board. We can create various analyses based on all the graphs and visualizations provided by Quicksight. We can play around with the different visualizations and column names to create meaningful and useful insights.

Finally, if you want to save or publish this board, you can by using the buttons on the top right corner.

So, using AWS tools like Glue, Lambda, Athena and Quicksight we managed to securely manage, streamline, and perform analysis on the structured and semi-structured YouTube video data based on the video categories and the trending metrics.

By doing so, we ended up creating an ETL pipeline, which extracts raw data from a Kaggle dataset, applies various transformations to it, and finally saves the transformed data in the most efficient manner.

For more insights, you may visit the git repository of the above project: https://github.com/Ishaan-Rawat/YOUTUBE-DATA-ENGINEERING-AWS/tree/master