Web Scraping with Python at Scale (Request, BeautifulSoup, Splash & Tesseract)

Published in

Shipsy Blog | Data Driven Logistics

3 min readMar 13, 2019

With data being at the heart of impactful decision making, web scraping becomes an indispensable tool, especially in the logistics space where tracking consignments from different sources form the backbone of many products. In this blog, I will discuss an efficient and scalable way to scrape data from different websites, with a special focus on the necessary tools required.

Tools

Flask (Web development framework for python)
Requests (Python library for network requests)
BeautifulSoup (Html parsing)
Splash (Javascript rendering engine)
Tesseract (Text based captcha)
PM2 (Manage Splash)

Installation

In order to scale our application, it is important that other services or applications cannot affect it. Hence let’s create a Dockerfile with all the necessary tools. Details are as follows:

Choose base image: We need to install a lot of different application not specific to python, hence we selected ubuntu16.04

FROM ubuntu:16.04

Tesseract: Often some of the websites have text-based captcha, which can be handled with Image processing (Open-CV) and OCR(Tesseract). Below commands will install the latest version of the Tesseract.

RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository -y ppa:alex-p/tesseract-ocr
RUN apt-get update && apt-get install -y tesseract-ocr-all

Splash: Before we install Splash let’s discuss it. Splash is a Javascript rendering service which keeps unbound in-memory cache i.e. it will consume all the system memory eventually. Hence it is very important to control the memory used by Splash. Though Splash has an option to limit the memory (--maxrss 1000), it checks the memory every 60 seconds. Hence, we used PM2 to restart the service along with keeping a check on the memory used by Splash.
Note: In case one needs to render a lot of Javascript enabled web pages, running several instances of Splash with the load balancer in place should be considered.

If you want to render a lot of Javascript enabled web pages, you should consider running several instances of Splash with the load balancer in place.
Splash Installation:
First, clone git repository of the Splash using the command:

git clone https://github.com/scrapinghub/splash/

The above command will create the Splash folder with all the necessary files. Add the following lines in the Dockerfile to install Splash.

WORKDIR /home/ubuntu
ENV DEBIAN_FRONTEND noninteractive
ENV PATH="/opt/qt59/5.9.1/gcc_64/bin:${PATH}"#Copy the splash files
COPY ./splash splash
RUN cp splash/dockerfiles/splash/provision.sh /tmp/provision.sh
RUN cp splash/dockerfiles/splash/qt-installer-noninteractive.qs /tmp/script.qs
RUN apt-get update
RUN apt-get install -y python3-pip
RUN /tmp/provision.sh \
prepare_install \
install_deps \
install_qtwebkit_deps \
install_official_qt \
install_qtwebkit \
install_pyqt5 \
install_python_deps \
install_flash \
install_msfonts \
install_extra_fonts \
remove_builddeps \
remove_extra && \
rm /tmp/provision.sh
RUN cp splash/. /app -r
RUN pip3 install /app
ENV PYTHONPATH $PYTHONPATH:/app
VOLUME [ \
"/etc/splash/proxy-profiles", \
"/etc/splash/js-profiles", \
"/etc/splash/filters", \
"/etc/splash/lua_modules" \
]

Now that we have Tesseract and Splash installed, we need to install all the python dependencies from the requirements.txt file. Below commands will copy the requirements.txt file and install it using pip3 install -r command.

RUN apt-get install -y g++
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

In order to manage Splash processes, we need to install pm2 which is a node based process manager. Below commands will install pm2 globally.

RUN apt-get install -y nodejs
RUN apt-get install -y npm
RUN npm config set registry http://registry.npmjs.org/
RUN npm i -g pm2
RUN ln -s /usr/bin/nodejs /usr/bin/node

We that we have all the dependencies installed, let’s see how we can run the Splash instance with pm2. Let’s create pm2-process.json file

{"apps": [
    {
        "name": "splash with pm2",
        "script": "/app/bin/splash",
        "cwd" : "/home/ubuntu",
        "interpreter": "python3",
        "interpreter_args" : "",
        "instances": 1,
        "exec_mode" : "fork",
        "args": "--port 8050 --disable-browser-caches --slots 10 --max-timeout 600 --maxrss 1000 --proxy-profiles-path /etc/splash/proxy-profiles --js-profiles-path /etc/splash/js-profiles --filters-path /etc/splash/filters --lua-package-path /etc/splash/lua_modules/?.lua",
       "max_memory_restart": "1200M"
    }]
}

The above pm2-process.json limits Splash memory 1000MB and 1.2GB is the hard limit specified (which can be configured that based on the use case). We will now start the Splash server and the scraping server using the shell script.

#!/bin/bash
pm2 start pm2_process.json
./server.sh

Download the complete Dockerfile and for other details refer GitRepo. Currently, we scrape over 1Million web pages per day with docker container running via ECS on t2.medium and yes the sky is the limit.

Please feel free to comment for suggestions and feedback.

Web Scraping with Python at Scale (Request, BeautifulSoup, Splash & Tesseract)

Tools

Installation

Written by Aman Ruhela