The Most Advanced Lyrics Extractor Python Library Explained

Rishabh Agrawal
HackerNoon.com
4 min readJun 8, 2019

--

A few months back, I was working on a chatbot and I thought to add a cool feature in it. I wanted to fetch lyrics for the user requested song names and deliver them back on my WhatsApp chatbot. Here’s an in-depth guide to building a Multi-featured Slackbot with Python

While I was searching for a Python library to extract lyrics for songs to integrate it into my WhatsApp and Slack chatbot, I couldn’t find any libraries or APIs which could only accept song names in their parameters. The APIs and libraries I tested required the accurate spelling of song names and artist name to be passed in for fetching the song lyrics. And even after passing all this information, there were still some song lyrics which weren’t available in the APIs and libraries.

This is when I decided to write an algorithm for fetching and scraping song lyrics from various websites even for any misspelled song name passed-in by the user.

How does this library work?

We make use of BeautifulSoup and Requests Python libraries to scrape song lyrics from a few of our selected websites.

Note: If you are planning to use this library for commercial purposes then I request you to have a look at the terms and policies of these websites and scrape lyrics only from sites which allow their content to be scraped and used commercially.

We start by creating our own Google Custom Search Engine. We then add website URLs from the list of websites provided in our requirements to fetch lyrics for the requested song names.

You are free to customize your Custom Search Engine by prioritizing any of your preferred keywords, excluding any web pages or turning on the ‘Safe Search’ feature.

Note: Please don’t turn on the ‘Search the entire Web’ feature as it is currently not possible to scrape from any random sites appearing in the search results.

Now copy your engine ID and API key and instantiate the class like this:-

Remember: Don’t forget to replace GCS_API_KEY with an API key and GCS_ENGINE_ID with the engine ID received.

You can get the title and the lyrics for the song by passing in the song name in the class function like this:-

As soon as you request for the lyrics of the passed song name, it is searched on the custom search engine that we just created.

If the API key and the Engine ID passed while instantiating the class is correct then the song name is checked for any spelling errors using Google suggested results. The results are fetched after auto-correcting the misspelled song name.

We extract the title and the link from the first search result as it is the most relevant search result for the user query according to Google. Yeah, smart use of Google’s search algorithm. 😃

Once we get the link, we figure out the website name from the link to run the right scraper for extracting the lyrics from it.

We extract the website’s HTML using BeautifulSoup and further extract the content by selecting it from appropriate classes.

Finally, we return the song title and lyrics to the user if found else we just return ‘No lyrics found for {song_name}.’

Major Issue

There’s one major issue with scraping from different websites like this is the classes we use can be changed or modified, or the entire DOM may be modified of the website. And this will directly affect our scraper as it will fail to fetch the song lyrics and it will simply return No lyrics found even if the lyrics exist for the song name.

To tackle this issue, we will have to regularly check if the scraper is working fine for all the added websites and update them accordingly if there seems an error.

Do you have any ideas on how can we resolve this? Comment below your suggestions.

Conclusion

In this post, I tried to cover how my lyrics-extractor Python library works without requiring the artist name and correctly spelled song name. This algorithm makes sure to return the most relevant song lyrics to the user. If song names are found common for different lyrics then the most popular one is picked up.

However, you have some control on the appearing search results which can be easily tweaked in from your custom search settings to fit your needs.

In the end, I would say do respect the website’s servers and please don’t bombard with a lot of requests at once, else you are bound to get an error.

If you find some bugs or issues, feel free to raise an issue on GitHub. I would highly appreciate if you raise a PR for fixing a bug or adding in a new feature.

Thank you so much for taking the time to read this! If you enjoyed it, please give it a bunch of claps as it’ll help more people see this story. And ★ this repository on GitHub if you liked this project. ❤

--

--

Rishabh Agrawal
HackerNoon.com

Trying to make your lives easier by creating better products through tech