Spotify ETL using Python and AWS

Gunjan Sahu
Nerd For Tech
Published in
12 min readJul 10, 2023

The focus of the course was a pipeline project, where I had the opportunity to work with the Spotify API and various AWS services. Although I made some minor adjustments to the code used in the project, the core concepts and structure remained intact.

This particular project is well-suited for individuals who are new to the field. Personally, I believe that starting with a guided project like this is beneficial for those who are unfamiliar with AWS services, as it allows them to grasp the fundamentals more easily. Once you have a solid foundation, you can then consider ways to enhance the project and explore additional problems and data sources to create a unique and customized experience.

Objective

Develop an ETL pipeline using Python and AWS services to obtain the artist, album, and song information from the “Discover Weekly” playlist on Spotify. This playlist is updated every Monday and consists of 30 personalized songs tailored for each user. Alternatively, you can use an IFTT service to create an archive of this playlist. In the course, Darshil used the “Top Songs — Global” playlist which is updated weekly.

The pipeline involves extracting data from the Spotify API, transforming the data to prepare it for analysis, loading the transformed data into Amazon S3, and querying the transformed data using Amazon Athena, utilizing AWS Glue for cataloging and metadata management.

Key Learnings

What you can learn from this project:

-How to extract data from API
-Build an automated trigger to run data pipelines (EventBridge trigger)
-Write extract and transformation jobs (Lambda) and store raw data (S3)
-Build the correct bucket structure for data storage
-Automate transformation job (S3 trigger) for new data
-Build analytics layer for SQL queries and business intelligence (Athena)

Architecture Diagram

Architecture Diagram

Tools used: Python, AWS services (CloudWatch, Lambda, S3, Glue and Athena)

This is a brief overview of what each service does:

  • S3 (Simple Storage Service): Easily store and retrieve large amounts of data. Each file is called an object and data is stored in buckets.
  • Lambda: Serverless compute service to run code without managing servers. We will use Lambda to deploy the Python code to perform data extraction and transformation
  • CloudWatch: Monitor and collect metrics from AWS resources. Can be used to monitor log files and set alarms
  • EventBridge: Create and manage events, schedule them based on a defined pattern or cron expression.
  • Crawler: Component of AWS Glue that automatically scans and analyzes data sources to infer their schema and create metadata tables
  • Glue Data Catalog: Fully managed metadata repository provided by AWS Glue. It acts as a central repository for storing and organizing metadata information about various data sources, including tables, schemas, and partitions. You can use the Glue Data Catalog without the Crawler if you already have the metadata information or prefer to define and manage the metadata manually and can directly create and populate tables in the Glue Data Catalog
  • Athena: Interactive query service to analyze data stored in various sources using standard SQL queries. You can query data from the Glue Data Catalog, S3 and other supported data sources.

This project can also be implemented using services from other cloud providers such as Google Cloud Platform (GCP) and Azure. The architecture diagrams for these 2 platforms are also included in the course.

If you understand the core concepts in one cloud platform, you can apply the same knowledge to other providers, as the underlying principles are consistent.

Pre-Requisities

  1. To access the Spotify API, you need to obtain credentials to authenticate and authorize your application. Create an account on the Spotify Developer Dashboard and register your application. You will receive client credentials which is a Client ID and Client Secret.
  2. (Optional): Set up alarm on CloudWatch to send an email if there are any charges beyond USD 5. You can also set up Free Tier Usage Alerts.
  3. Create a bucket in S3. The bucket name must be globally unique and I selected Singapore as the AWS region. This is the folder structure in the S3 bucket I created:

/discover_weekly: Main folder in bucket
/raw_data: Raw data is stored here
* to_process: When the data extraction function is invoked, data extracted from the API will be stored here
* processed: When the transformation function is invoked, files in to_process folder will be copied to this folder and the file in to_process will be deleted. We are just moving data from one folder to another.
/transformed_data: These 3 folders will contain the transformed dataset where basic cleaning and transformation have been applied.
* /album_data
* /artist_data
* /song_data

ETL Process Overview

Jupyter Notebook is used to develop the initial Python code for data extraction and transformation. Codes used in the project can be found here.

These are the overall steps in the process:

1.Extract
-Extract data from Spotify API using the Spotipy library
-Deploy the data extraction code using the Lambda function
-Run trigger using EventBridge to automate data extraction every Tuesday at 4 pm
-Data extract is saved in the discover_weekly/raw_data/to_process folder in the S3 bucket

2. Transform
-Run S3 trigger when any new data is added into the discover_weekly/raw_data/to_process folder in the S3 bucket. This will run the data transformation code on Lambda
-The transformation code will clean and transform the data to prepare 3 files for the album, artist, and songs. The data will be stored in the 3 subfolders in transformed_data. Lastly, files in the to_process folder will be copied to the processed folder and files in to_process will be deleted. We are just moving data from one folder to another.

3. Load
-Glue crawler will infer schema when new data arrives in the 3 folders in the transformed_data folder
-Data catalog manage metadata repository
-Query S3 data using Athena

Data Model diagram

Extract and Transform

We will combine the steps for extract and transform in one section as the triggers will be set up at the end although this approach may vary on the specific use case/requirements.

Step 1: Create a layer in Lambda

We will use the Spotipy library for the Spotify web API. A layer can package libraries and other dependencies that you can use with Lambda functions. I uploaded the .zip file provided in the course. Select Python 3.8 as the runtime.

Add a layer for the Spotipy library

Step 2: Create a Lambda function (extract_data) to extract the data

Create extract_data function

Configuration tab (Environment variables, timeout, roles)

a. Select Environment variables and click on the Edit button. Add client_id and client_secret as the keys and add in the respective values. Environment variables can be configured for Lambda function to keep sensitive information separate from the code.

Create environment variables for client_id and client_secret

b. Select General configuration and click on the Edit button. Set the timeout to 1 min 30 seconds. This can be adjusted accordingly based on the expected runtime and nature of task. The default timeout is set to 3s. It ensures functions do not run indefinitely and prevent potential excessive costs.

Set the timeout

c. Select Permissions. Check that the role is set to a role which has these permissions: AmazonS3FullAccess and AWSLambdaRole. Otherwise, click on Add permission and attach the 2 policies which grant specific permissions to perform actions within AWS services.

Role in Identity and Access Management (IAM)

Code tab (Add in code, Spotipy layer, test event and deploy)

a. Add in the code for data extraction

Code explanation:
-lambda_handler function is the entry point for the Lambda function.
-It retrieves the Spotify API credentials from the environment variables and playlist data
-Filename is based on the current timestamp and data extracted is saved in the to_process folder

b. Add the Spotipy layer

Add the spotipy layer in the function

Layer is now added in the function

c. Configure test event. Use any event name and use the default setting
Click on Deploy. After deployment, you can invoke the Lambda function manually (Test event) or through triggers like CloudWatch events, S3 events or other AWS services

Step 3: Create a Lambda function (transform_spotify_data) to transform the data

You can select the same role used in the previous function.

Create transform_spotify_data function

Configuration tab (timeout)
a. Select General configuration and click on the Edit button. Set the timeout to 1 min 30 seconds.

Code tab (Add in code, Pandas layer, test event and deploy)

a. Add in the code here

Code explanation:

-3 functions (artist, album and songs) to extract specific information from the Spotify API data
-Lists and retrieves JSON files from the specified S3 bucket and prefix, storing the data in spotify_data and the file keys in spotify_keys.
-For each JSON object, the 3 functions are called and results are stored in Pandas dataframes. Duplicate rows are removed and the date columns are converted to datetime format.
-Export the transformed data to CSV files in the respective subfolders
-Move the files in to_process to processed folder and delete files in to_process

b. Add the Pandas layer: AWSSDKPandas-Python38

Add the Pandas layer in the function

Similarly, you can set the test event and deploy the function.

Step 4: Add trigger in the extract_data function to automate data extraction

By configuring an EventBridge rule with a cron expression, it will invoke the function to extract data from the Spotify API based on the given schedule.

In the extract_data function, click on Add trigger.

Add trigger with EventBridge as the source

Set the source to EventBridge (CloudWatch Events). Create a new rule. Set the rule name. For the rule type, select Schedule expression. In the example below, we have set it to cron(0 16 ? * TUE *). This expression specifies that the event should occur at 4:00 PM (UTC 0) every Tuesday.

Add trigger to the function

Step 5: Add trigger in the transform_spotify_data function to automate data transformation

Create a trigger to invoke the transform_spotify_data function whenever new objects are added to the folder (discover_weekly/raw_data/to_process)

In the transform_spotify_data function, click on Add trigger. Select the bucket. Use “All object create events” as the event type. Set the prefix to the folder where we want to monitor for new objects. Set the suffix to .json to limit the trigger’s activation to specific types of objects within the bucket.

Create trigger with S3 as the source

Load

Step 6: Create a crawler

The steps below have to be repeated 3 times. In this example, we will create a crawler for data stored in the song_data folder.

a. Set the crawler properties. I set the name to dw_song_crawler

b. Add S3 as the data source. Select the in this account option. Browse for the folder on S3. Add in a / at the end of the path. Click on Add an S3 data source.

c. Configure security settings. Click on Create new IAM role and use the default settings. Select the role from the dropdown.

d. Set output and scheduling. Click on Add database. Refresh and select it from the database created in the dropdown.

e. Review and create crawler. Click on Create Crawler.

f. Run crawler

Create crawlers for artist and album data by repeating the steps above.

In AWS Glue > Databases, select spotify_dw_db. You can see the respective tables here.

Tables in the database

To edit schema, click on a table. You can click on edit the schema as JSON or edit schema to make any modifications.

Click on Advanced properties tab. Add in skip.header.line.count as the key and 1 as the value to skip the first line (header line) during the data ingestion process. Click Save

Step 7: Querying with Athena

a. Open Athena. Data source is AwsDataCatalog. Select spotify_dw_db as the database. You can view the tables along with their respective columns.

In the Query Editor, you might notice a message at the top. Before running your first query, it is necessary to configure the query result location. This step involves setting up the location where query results will be stored. This allows you to retrieve and analyze the results at a later time.

b. Click on Edit Settings. I created a new folder called queryresults. Click on Browse S3 and select this path.

c. You can query the data now

Additional Improvements

Create a dashboard to visualize the data. AWS Quicksight can be used but I thought of exploring Preset. You can establish a connection to AWS Athena and perform SQL queries using SQLAlchemy and the Pandas library, with the query results being stored in the specified S3 location. You can click on Test Connection to see if the URI is correct.

Example of SQLACHEMY URI: awsathena+pandas://[insert access key here]:[insert secret access key]@athena.ap-southeast-1.amazonaws.com/<schema_name>?s3_staging_dir=s3://spotifyapidata/discover_weekly/queryresults/

The S3 bucket specified is used for storing the query results and temporary files generated during query execution in Athena.

Once the database is connected, you can create dataset and chart. The platform seems easy to use!

Questions That I Have

While working on the project, I had a lot of questions which require me to do additional research. Here are the main questions:

1.If the data did not change and crawler is ran, will the data in Athena be the same or will it append the same records into the table in the database?

Running the crawler does not modify or append the existing data in Athena. It only updates the metadata in the AWS Glue Data Catalog, which is used by Athena for querying the data. Basically, it looks for new or changed files or tables in your data store as mentioned in this Stack Overflow post

2.How to handle values with commas? By default, Glue Crawler splits the field into columns at the commas. I tried to follow the steps mentioned in this post but it did not work. I was advised to look into escape comma in python and transform such values in the code itself. I have not attempted this yet.

Conclusion

This project serves as good introduction to ETL pipelines using Python and AWS services. You can gain hands-on experience working with the Spotify API and core AWS services such as Lambda, S3, Glue, and Athena.

--

--