Data Harvesting simplified using Python, Flask, and SQL

Ankit Raj
Analytics Vidhya
Published in
9 min readDec 14, 2020

Ever thought of having an app that could be used as a search engine to query an API, harvest data, and perform basic ETL operations before dumping the processed data into a SQL table.

Seems a bit confusing?

Of course, as there is much processing required before that data could be used for analysis.

So, how about we build an app that handles all the data processing in the backend and keeps on populating a SQL table that we could use for further analysis.

Interesting enough? Buckle up then, it’s a long one…..

First things first, what exactly are we harvesting and where are we storing the processed data?

I build this app to harvest music data from Spotify API and store the processed data in a SQL server database. And, it goes without saying that you can modify this application as per your project requirements.

Now, before we jump into the interesting part of building the application, let’s first keep our database ready.

How should database architecture look like?

Fig 1.1. SQL Database

As we can see from Fig 1.1, we will be needing a working table: WKR_SPOTIFY_DATA, to dump the raw data after processing, a final table: SPOTIFY_DATA, to store the data with correct datatypes, and a stored procedure: BLD_SPOTIFY_DATA, to insert data into the final table from the worker table.

Let’s just take a quick look at the two table designs and the stored procedure.

Fig 1.2. Database table designs

So, our worker table just stores everything in VARCHAR format, and the final table stores them with the correct datatypes. We also have created columns for all the data that we need from Spotify, so now we have a clear idea of what we need to extract.

Fig 1.3. Stored Procedure Declaration

First, we need to create a stored procedure and we will be performing two operations with this stored procedure that are explained below.

Fig 1.4. First functionality of the Stored Procedure

This step populates the final table with the data stored in the worker table.

Fig 1.5. Second functionality of the Stored Procedure

Once the data is dumped into the final table, it is important to keep the worker table clean for the next data dump.

All right then, we have our database set up to store the processed data, it’s time to start building our app.

How will our application make connections to the API and the database?

We will be using spotipy to connect to the Spotify API and sqlalchemy to create the database connection.

Fig 2.1. Connection Manager

A simple class to handle the different connection requests will always come in handy when dealing with multiple connection requests.

Now that we have got different connections covered, as shown in Fig 2.1, it’s time to proceed to the main part of this project, and yes you guess it right, it’s time to create the architecture for extracting and processing the required information from the Spotify API request.

What functionalities do we need to harvest and process the extracted data?

Just to be clear, as the main focus of this article is on building a flask application, I won’t be going into details about each of the functionalities. I, however, will share the article in references that guided me while building this application.

But, nothing to worry about, we will still discuss each of the functionality.

Fig 3.1. Spotify Data Harvestor Construction

So, here’s what our constructor looks like, as shown in Fig 3.1., we have instantiated all the variables that we will be needing throughout the class. We will be passing the Spotify connection, database engine, artist URI, and artist name when creating an object of this class.

Fig 3.2. Function to extract the song features from an album

A pretty straight forward functionality, shown in Fig 3.2, to extract the various features of a song using an album URI. As initially shown in our database tables, we need song features like album name, track number, song id, song name, and song URI from Spotify.

Fig 3.3. Function to extract the audio features of an album songs

We surely need more information about a song track, if have to perform some kind of audio analysis, right?

Well yeah, and thankfully Spotify even got us covered in terms of the different audio features, required for any kind of audio analysis. All we need to do is to make the correct API call to Spotify.

The function shown in Fig 3.3 covers all the steps required to extract the audio features of all the tracks of an album.

Alright, so we have created the functions required to extract the features that we need, it’s time to start processing them into a pandas dataframe and finally dump them into the database worker table.

Fig 3.4.1. Function to call the extraction methods

The first part of our final method for building a database table, Fig 3.4.1, calls the extraction methods so that the spotify_albums dictionary instantiated in the constructor is ready for building our final pandas dataframe.

Just make sure not to overkill the API calls when iterating over the Spotify albums, I would suggest calling the sleep function after every 5 iterations.

Fig 3.4.2. Function to build the dataframe and dump data into worker table

We have now extracted all the song and audio features. The last thing left, before we start dumping the data into the worker table, is to use a dictionary and extend it over all the album features. We will be converting this dictionary into a pandas dataframe, as pandas provides functionality to convert the dataframe directly into a SQL table.

A few important things to note here:

  • We will be passing the name of our worker table in the database,
  • Sqlalchemy engine will be passed as the connection (don’t worry I will show you that configuration as well),
  • Make sure you keep if_exists=’append’, so that you don’t create a new table in each session and use the existing table,
  • Just keep index=False, we don’t need indexing here,
  • We need multiple data to insert, so we need to keep method=’multi’,
  • And keep the chunk to a safe size of 100, so that we don’t run into some issue while the data dump.

Now that we have got data dumped into our worker table, the next thing left is to perform a little transformation to dump the data from the worker table to the final table.

And, guess what we already have a stored procedure ready to take care of that for us. So, all we need to do is to call that stored procedure, right?

Well, let’s see how easily can we do that with the help of the sqlalchemy engine, created in the Connection Manager.

Fig 3.5. ETL Architecture for the final transformation

Alright then, we have created the entire backend required for our app.

And the good thing is that along with that all the heavy lifting is done, what’s left is just to create a web app and obviously a few more lines of code. And, trust me with a tool like flask, it will be a piece of cake.

How do we build our flask web application?

Fig 4.1. Call the dependencies

Before we start building our flask app, make sure you call all the dependencies, as shown in Fig 4.1, before we move on to the application building.

Fig 4.2. Flask App Architecture

We are now ready with all the required dependencies, we can now start building the flask app.

Let’s try to simplify the steps required for our flask app, as shown in Fig 4.2.

  • Create a flask app.
  • Define the __name__ variable to run the flask app.
  • Build a function that renders the home page on running the application, and the flask way to handle that is with a simple annotation that routes to home page denoted by ‘/’.
  • Build a function that takes input from the user in the form of a request, from the web page, and makes a call to the backend operations explained above. Our extract method does exactly that and again make sure to annotate the method with the web page extension, denoted by ‘/extract’.
  • Finally, just to give an overview of the functionality of the extract method, it takes a user input, queries the Spotify API, using the Spotify connection created in the Connection Manager, to extract all the artist URI and artist name for all the items of a track, stores them in a dictionary, as shown in Fig 4.2., calls the SpotifyDataHarvester for each item in the dictionary to dump the data into the worker table, and finally when the worker table is prepared, calls the ETL to build the final table.
  • And, now all that’s left for you is to design and built a UI for this app. You can refer to my GitHub repository, mentioned below, for the UI design that I used.

Phewww….. that was a lot of code, wasn’t it?

So, just give yourself a few moments of peace, before we jump into the good stuff.

And, I believe you already know what’s left to do.

Yes, you are absolutely right!

It’s time to give our app, a test run. So, let’s get done with that.

Final App Testing

Fig 5.1. Home Screen

Okay… so our app looks nice, and the home page loaded without any issue. Let’s try harvesting some data for a track.

Fig 5.2. Harvesting in process

So, the harvest has started, let’s take a quick look at what’s happening behind in the console screen.

Fig 5.3. Backend Processing Console Screen

Looks like the data extraction is going well and some data is already dumped into the working table. God! Don’t you love these little print statements………

Fig 5.4. Harvest complete screen

Finally, the harvest is completed, we got our sweet little completion message.

Now it’s time to see our final table is populated or not, plus the worker table should be truncated after each harvest as we discussed earlier.

Fig 5.5.1. Final Table

Ahaa.. we got our final table populated, let’s just make sure the worker table is truncated, before we wind up and celebrate all the hard work, we just did.

Fig 5.5.2. Worker Table

Here we go, the worker table is truncated, as expected, after the harvest.

God, this was a long tour, wasn’t it…?

But, thanks a lot for keeping up with me this long.

I hope this app will help ease up the mundane data harvesting process and you will rather start enjoying the data harvest…

You can find the full project on my GitHub repository mentioned below,

https://github.com/AnkitRajSri/Music_Data_Extractor

Happy harvesting…..

References:

https://rareloot.medium.com/extracting-spotify-data-on-your-favourite-artist-via-python-d58bc92a4330

--

--