playgrdstar
quaintitative
Published in
1 min readAug 1, 2018

--

PDF Downloads

Going through a page to find and download PDFs can sometimes be a huge pain. We can easily automate this process.

We first import the following libraries -

  • Beautiful Soup (bs4), to help us tease the good stuff out from HTML and XML
  • request, to help us make HTTP requests to specific webpages
  • re, for recognise specific text patterns easily
import bs4 
import requests
import os

Next, specify the webpage which you would like to download pdfs from. Say the list of pdfs relating to Fed minutes.

source = ['https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm']

Some string manipulation first. I need the domain address later. Simple split the string at ‘/monetarypolicy‘ and get the first item that is returned.

domain = source[0].split("/monetarypolicy")[0]

You will get this.

'https://www.federalreserve.gov'

Next, we get the content of the page in the source.

html = requests.get(source[0]).content

Then we parse it with BeautifulSoup. This then allows us to get at all the links with the text ‘pdf’ (using the re library) using the findAll function.

soup = bs4.BeautifulSoup(html, 'html5lib')
links = soup.findAll('a', href=re.compile("pdf"))

Go through the links on the page and download the pdfs. The name of the file is also extracted by splitting the string

for link in links:
filename = link['href'].split('/files/')[1].split('.')[0]
target = requests.get(domain+link['href'])
if target.status_code==200 and target.headers['content-type']=='application/pdf':
with open(filename +'.pdf', 'wb') as pdf:
pdf.write(target.content)

The Jupyter notebook with the code is here

--

--

playgrdstar
quaintitative

ming // gary ang // illustration, coding, writing // portfolio >> playgrd.com | writings >> quaintitative.com