Life Hack Web Scrapping

Web scrapping has made my life SO MUCH EASIER. Yet, the process for actually extracting content from majority of websites is never really mentioned. This makes processing information nearly impossible

Kamron Bhavnagri
Towards Data Science

--

Why?

Web scrapping has made my life SO MUCH EASIER. Yet, the process for actually extracting content from websites which lock their content down using proprietary systems is never really mentioned. This makes it extremely difficult if not impossible to reformat information into a desirable format. Over a few years, I’ve found several (nearly) fail proof techniques to help me out, and now I’d like to pass them on.

I’m going to walk you through the process of converting a web-only book to a PDF. The idea here though is to highlight how you can replicate/modify this for your own circumstances!

If you have any other tricks (or even useful scripts) for tasks like these, make sure to let me know, as creating these life-hack scripts is an interesting hobby!

Cover image sourced from here

Reproducibility/Applicability?

The example I’m outlining is from a website which provides only-only study guides (to protect their security I’m excluding specific URL’s). I’m outlining several flaws/hiccups which often come up when web scrapping!

Mistakes to Make?

I’ve made several mistakes when trying to web scrape for limited access information. Each mistake consumed large amounts of time and energy, so here they are:

  • Using AutoHotKey or similar to directly affect the mouse/keyboard (this produces dodgy inconsistent behavior)
  • Load all pages and then export a HAR file (HAR files don’t actual data and take ages to load)
  • Attempt to use GET/HEAD requests (most pages use authorization approaches which aren’t realistically reversible)

Slow Progress

It seems easy/quick to write a 300 line short script for web scrapping these websites, but they are always more difficult than that. Here are potential hurdles with solutions:

  • Browser profile used by Selenium changing
  • Programmatically find the profile
  • Not knowing how long to wait for a link to load
  • Detect when the link isn’t equal to the current one
  • Or use browser JavaScript (where possible, described more below)
  • Needing to find information about the current web page’s content
  • Look at potential JavaScript functions and URL’s
  • Restarting a long script when it fails
  • Reduce the number of lookups for files
  • Copy files to predictable locations
  • Before beginning doing anything complex check these files
  • Not knowing what a long script is up to
  • Print any necessary output (only for that which takes considerable time and doesn’t have another metric)

Code

Preperation

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from PIL import Image
from natsort import natsorted

import time
import re
import os
import shutil
import img2pdf
import hashlib
driver = webdriver.Firefox()
cacheLocation = driver.capabilities['moz:profile'] + '/cache2/entries/'
originalPath = os.getcwd()
baseURL = 'https://edunlimited.com'

Loading Book

driver.get(loginURL)
driver.get(bookURL)

wait.until(lambda driver: driver.current_url != loginURL)

Get Metadata

Quite often it is possible to find JavaScript functions which are used to provide useful information. There are a few ways you may go about doing this:

  • View the page’s HTML source (right-click ‘View Page Source’)
  • Use the web console
bookTitle = driver.execute_script('return app.book')
bookPages = driver.execute_script('return app.pageTotal')
bookID = driver.execute_script('return app.book_id')

Organize Files

Scripts often don’t perform as expected, and can sometimes take long periods of time to complete. Therefore it’s quite liberating to preserve progress throughout the script’s iterations. One good method to achieve this is keeping organized!

if not os.path.exists(bookTitle):
os.mkdir(bookTitle)
if len(os.listdir(bookTitle)) == 0:
start = 0
else:
start = int(natsorted(os.listdir(bookTitle), reverse=True)[0].replace('.jpg', ''))
driver.execute_script('app.gotoPage(' + str(start) + ')')

os.chdir(bookTitle)

Loop Through the Book

Images are always stored in the cache, so when all else fails, just use this to your advantage!

This isn’t easy though, first of we need to load the page and then we need to somehow recover it!

To make sure we always load the entire page, there are two safety measures in place:

  • Waiting for the current page to load before moving to the next
  • Reloading the page if it fails to load

Getting these two to work requires functions which guarantee completion (JavaScript or browser responses), and fail-safe waiting time-spans. Safe time spans are trial and error, but they usually seem to work best between 0.5 to 5 seconds.

Recovering specific data directly from the hard drive’s cache is a relatively obscure topic. The key is to first locate a download link (normally easy as it doesn’t have to work). Then run SHA1, Hex Digest and a capitalizing function on the URL, which produces the final filename (it isn’t just one of the above security algorithms, as older sources lead you to believe, but both).

On a final note, make sure to clean your data (removing the alpha channel from PNG images here) now instead of afterwards, as it reduces the number of loops used in the code!

for currentPage in range(start, bookPages - 1):

while driver.execute_script('return app.loading') == True:
time.sleep(0.5)



while (driver.execute_script('return app.pageImg') == '/pagetemp.jpg'):
driver.execute_script('app.loadPage()')
time.sleep(4)

location = driver.execute_script('return app.pageImg')


pageURL = baseURL + location
fileName = hashlib.sha1((":" + pageURL).encode('utf-8')).hexdigest().upper()
Image.open(cacheLocation + fileName).convert('RGB').save(str(currentPage) + '.jpg')

driver.execute_script('app.nextPage()')

Convert to PDF

We can finally get that one convenient PDF file

finalPath =  originalPath + '/' + bookTitle + '.pdf'


with open(finalPath, 'wb') as f:
f.write(img2pdf.convert([i for i in natsorted(os.listdir('.')) if i.endswith(".jpg")]))

Remove Excess Images

os.chdir(originalPath)
shutil.rmtree(bookTitle)

Cover image sourced from here

Thanks for READING!

This is basically the first code-centric post I’ve made on my blog, so I hope it has been useful!

— — Until next time, I’m signing out

--

--

Super passionate up and coming data scientist documenting my journey! Learning and creating ML content (data science projects and blog posts)