PDF Downloads

playgrdstar

Published in

quaintitative

1 min readAug 1, 2018

PDF Downloads

Going through a page to find and download PDFs can sometimes be a huge pain. We can easily automate this process.

We first import the following libraries -

Beautiful Soup (bs4), to help us tease the good stuff out from HTML and XML
request, to help us make HTTP requests to specific webpages
re, for recognise specific text patterns easily

import bs4 
import requests 
import os

Next, specify the webpage which you would like to download pdfs from. Say the list of pdfs relating to Fed minutes.

source = ['https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm']

Some string manipulation first. I need the domain address later. Simple split the string at ‘/monetarypolicy‘ and get the first item that is returned.

domain = source[0].split("/monetarypolicy")[0]

You will get this.

'https://www.federalreserve.gov'

Next, we get the content of the page in the source.

html = requests.get(source[0]).content

Then we parse it with BeautifulSoup. This then allows us to get at all the links with the text ‘pdf’ (using the re library) using the findAll function.

soup = bs4.BeautifulSoup(html, 'html5lib')
links = soup.findAll('a', href=re.compile("pdf"))

Go through the links on the page and download the pdfs. The name of the file is also extracted by splitting the string

for link in links:
	filename = link['href'].split('/files/')[1].split('.')[0]
	target = requests.get(domain+link['href'])
	if target.status_code==200 and target.headers['content-type']=='application/pdf':
		with open(filename +'.pdf', 'wb') as pdf:
			pdf.write(target.content)

The Jupyter notebook with the code is here

PDF Downloads

Written by playgrdstar