Downloading hundreds of PDF files manually was…tiresome.
One fine day, a question popped up in my mind: why am I downloading all these files manually? That’s when I started searching for an automatic tool.
This sounded like a fun automation task and since I was eager to get my hands dirty with web-scraping, I decided to give it a try. The idea was to input a link, scrap its source code for all possible PDF files and then download them. Let’s break down the steps.
Using a simple try except block, I check if the URL entered is valid or not. If it can be opened using
urlopen, it is valid. Otherwise, the link is invalid and the program is terminated.
In Python, html of a web page can be read like this:
html = urlopen(my_url).read()
However, when I tried to print it on my console, it wasn’t a pleasant sight. In order to get a properly formatted and humanly readable html source code, I tried doing this with BeautifulSoup, which is a Python package for parsing HTML and XML documents:
html_page = bs(html, features=”lxml”)
Now, I had two main websites from which I occasionally downloaded pdf files. Upon evaluating the html code of both, I realized that the content of their meta tags was slightly different. For example, one of the websites had this:
<meta content=”Chemistry” property=”og:title”/>
while another website had no
og:title and had this instead:
<meta content=”text/html; charset=utf-8" http-equiv=”Content-Type”/>
In order to get usable meta-data, I added this:
og_url = html_page.find(“meta”, property = “og:url”)
and got something like this as a result:
<meta content=”https://cnds.jacobs-university.de/courses/cs-2019/" property=”og:url”/>
Parse Input URL
Next, it was time to parse and evaluate the input URL.
base = urlparse(my_url)
The results looked like this:
ParseResult(scheme=’https’, netloc=’cnds.jacobs-university.de’, path=’/courses/os-2019/’, params=’’, query=’’, fragment=’’)
Now, I knew the scheme, netloc (main website address), and the path of the web-page.
Find PDF links
Now that I had the html source code, I needed to find the exact links to all the PDF files present on that web-page. If you know HTML, you would know that the
<a> tag is used for links.
First I obtained the links using the
href property. Next, I checked if the link ended with a .pdf extension or not. If the link led to a pdf file, I further checked whether the
og_url was present or not.
og_urlwas present, it meant that the link is from a cnds web page, and not Grader.
current_links looked like
p1.pdf, p2.pdf etc. So to get a full-fledged link for each PDF file, I extracted the main URL using the content tag and appended my current link to it. For example, the
org_url[“content”] looked like this:
while the current link was
p5.pdf.When appended together, I got the exact link of for a pdf file:
While trying to download PDFs from another website, I realised that the source codes were different. Hence, the links had to be dealt with differently. Since I had already parsed the URL, I knew its scheme and netloc. Upon appending the current link to it, I could easily get the exact link for my PDF file.
and there it was! My very own notes downloading web scraping tool. Why waste hours downloading files manually when you can copy paste a link and let Python do its magic?
Here’s what my overall code looked like:
Want to try it? Feel free to fork, clone, and star it on my Github. Have ideas to improve it? Create a pull request!
Github Link: https://github.com/nhammad/PDFDownloader