Using Python libraries to beat APIs

One of my current projects is a bot for Discord, a popular messaging application similar in a lot of ways to Slack. I wanted my bot to add a series of functions to the vanilla Discord experience, one of them being a quick YouTube search function. The idea was to have the bot take in a query from a user as input, and return a link for the first result. At first I assumed the best way of going about this was to use Google’s YouTube API; however, after a little bit of searching I came across this article: Python — Search Youtube for Video. I was impressed at how the solution to my problem could be solved with less than ten lines of code:

From here I’m going to attempt to explain how this works. We start by importing two modules from the urllib library and re. The request module of urllib provides us with tools to open webpages from links and retrieve their data, while the parse module helps us break apart and put URLs back together. Finally, the re library allows us to use regular expressions in order to quickly search through strings to find a specific piece of information we might need.

Before we get started on the actual body of code, we need to look at what a YouTube URL looks like. Here are some examples:

< 'https://www.youtube.com/watch?v=ENcnYh79dUY' >
< 'https://www.youtube.com/watch?v=PWbRleMGagU' >

We can see that the only thing that changes is the 11 character identifier at the end of the URL. So, if we are somehow able to retrieve those 11 characters, we can concatenate “https://www.youtube.com/watch?v=” to the front and return it. Next let’s take a look at some YouTube URLs for a results page. Assume our queries are “that’s why i gave up on music” and “ghost in a flower.” Respectively, the links would be:

< https://www.youtube.com/results?search_query=that%27s+why+i+gave+up+on+music > < https://www.youtube.com/results?search_query=ghost+in+a+flower >

Again we can see that between the two queries, a majority of the URL stays the same; the only part that changes is the identifier after “https://www.youtube.com/results?search_query=”. This identifier seems easier to retrieve; all we have to do is percent encode our queries and concatenate that with “https://www.youtube.com/results?search_query=”. Since the results page contains the link for the first result to our query, if our program navigates to the results page, accesses its content, and retrieves the 11 character identifier for our video, we will be able to make our link.

Luckily, the libraries we imported at the start of our program help us with a majority of these tasks. urllib’s parse module has a function urlencode which, given a dictionary with a key value pair, will return a percent encoded string for our query.

query_string = urllib.parse.urlencode({"search_query" : “INPUT-STRING-HERE”})// if we replace "INPUT-STRING-HERE" with "that's why i gave up on music," query_string would equal "search_query=that%27s+why+i+gave+up+on+music"

Now that we have query_string, we can proceed to retrieve the results page’s content by using urllib’s request module:

html_content = urllib.request.urlopen("http://www.youtube.com/results?" + query_string)

In this line, we concatenate query_string to “https://www.youtube.com/results?” and request the contents of the page the resulting string leads us to. The content we receive is not encoded in UTF-8, so we use the .decode() function to decode it into a UTF-8 string Python can use (represented by html_content in the code). Now that we are able to read the contents, we need a way to isolate every 11 character identifier that comes after the “/watch?v=” part of every YouTube link. Regular expressions are a great solution to this problem. I won’t explain regular expressions in this post, but every 11 character identifier that comes after “/watch?v=” is represented by this regular expression: ‘/watch\?v=(.{11})’

We can then use the re library’s findall function to find all cases of 11 character identifiers as they appear in html_content:

search_results = re.findall('/watch\?v=(.{11})', html_content.read().decode())

What we end up with is a list containing all 11 character identifiers found when using the regular expression. The first result will always be the first element in the list, as it always appears before the other videos on the results page. So, we can concatenate search_results[0] to the end of “https://www.youtube.com/watch?v=” in order to finally build the YouTube link to the first result of our query. All that’s left is to either print, return, or store the value of the string.

I hope that explanation was clear; if you have any questions, feel free to contact me and I’ll do my best to clear any doubts up. It’s interesting how there are many solutions to problems; the beautiful part about code is that, while there may be better ones, there’s always going to be multiple solutions to a problem. It’s good practice to use APIs when you can, but keeping other possible alternatives such as using libraries like urllib or re to solve your problems can make your code shorter and more elegant.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store