Table of Contents
- Why dockerizing your scraper
- Scraping sample
- Flask file
- Scraping script
- Run everything
- Summarizing and problem related links
- Stackoverflow links for detailed research
Why dockerizing your scraper
Web scraping is one of the most often used methods to get data for further analysis. Lately, I used it as well and realized that there are various issues when containerizing (ie using Docker) your scraping function. However, as you need to deploy your web scraping when you need the most recent data, you will face different issues. I want to summarize my solution to save everyone the time of endless hours of online research.
I am using:
- Conda environment
- Pip for installing
- Docker for building images and containers
- Selenium and the chrome browser for web scraping
This works with the following versions:
- Python 3.7
I will first show the file and explain afterward why I chose the setup like this.
from flask import FlaskURL = 'https://google.com/'app = Flask(__name__)
result = scrape_site(URL) return resultif __name__ == '__main__':
import timefrom bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.headless = True
# options.add_argument('--disable-dev-shm-usage') # Not used
driver = webdriver.Chrome(options=options) driver.get(SAMPLE_URL) time.sleep(5) for t in range(10):
driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_DOWN) for t in range(10):
driver.find_element_by_css_selector('.additional_data').click() src = driver.page_source
parser = BeautifulSoup(src, "html.parser") driver.close() return src, parser
First of all, we define various driver options. We want to have it headless when we run it in the cloud. Yet we need the other flags as well to avoid multiple issues concerning
- not finding elements
- having startup issues
- having tab crashing issues
The argument ‘ — disable-dev-shm-usage’ is a workable option, but I solve it differently in this situation, namely by adding the flags “-v /dev/shm:/dev/shm — shm-size=2gb” when calling
docker run. (Details below)
FROM python:3.7ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1ENV FLASK_APP=app.py
ENV FLASK_ENV=development# install system dependencies
RUN apt-get update \
&& apt-get -y install gcc make \
&& rm -rf /var/lib/apt/lists/*s
# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update
RUN apt-get install -y google-chrome-stable# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/RUN python3 --version
RUN pip3 --versionRUN pip install --no-cache-dir --upgrade pipWORKDIR /appCOPY ./requirements.txt /app/requirements.txtRUN pip3 install --no-cache-dir -r requirements.txtCOPY . .EXPOSE 8080CMD ["gunicorn", "--bind", "0.0.0.0:8080","--timeout", "90", "app:app"]
Note the section on installing chrome and chrome driver.
When you scrape larger sites or data you might run into a timeout error. That’s why we add the ‘“ — timeout”, “90”’ flag in the Dockerfile CMD section.
docker build -t YOUR_IMAGE_NAME .docker run -v /dev/shm:/dev/shm --shm-size=2gb -d -p 80:8080 YOUR_IMAGE_NAME
Then go to your localhost and check what the response is.
You can inspect your progress in the Docker Dashboard:
When you have followed all the previous steps your containerized web scraper with Flask should work.
Summarizing and problem-related links
To summarize the problems I faced issues related to
- not having the chrome driver (that’s why we need to install it separately within the Dockerfile)
- startup and scraping issues (that’s why we add several chrome driver options)
- certain elements are not found. most of the time those are loading issues (that’s why we add time.sleep() during the scraping)
- too small shared memory (That’s we use the shm-related flags)
Stackoverflow links for detailed research
I am not associated with any of the services I use in this article.
I do not consider myself an expert. I merely document things besides doing other things. Therefore the content does not represent the quality of any of my professional work, nor does it fully reflect my view on things. If you have the feeling that I am missing important steps or neglected something, consider pointing it out in the comment section or get in touch with me.
This was written on 28.02.2021. I cannot monitor all of my articles. There is a high probability that when you read this article the tips are outdated and the processes have changed.
I am always happy for constructive input and how to improve.
Daniel is an artist, entrepreneur, software developer, and business law graduate. His knowledge and interests currently revolve around programming machine learning applications and all their related aspects. To the core, he considers himself a problem solver of complex environments, which is reflected in his various projects.