Python Webscraping in a Docker Container

Daniel Deutsch
Mar 1 · 4 min read
Source: AI created Art by Author. More examples on https://www.instagram.com/art_and_ai/; inspired by Lina Silivanova https://unsplash.com/photos/9wLq0zC_sOE

Table of Contents

Why dockerizing your scraper

Web scraping is one of the most often used methods to get data for further analysis. Lately, I used it as well and realized that there are various issues when containerizing (ie using Docker) your scraping function. However, as you need to deploy your web scraping when you need the most recent data, you will face different issues. I want to summarize my solution to save everyone the time of endless hours of online research.

I am using:

  • Python
  • Conda environment
  • Pip for installing
  • Docker for building images and containers
  • Selenium and the chrome browser for web scraping

Versions

This works with the following versions:

Docker Version; Image by author
  • Python 3.7
  • selenium==3.141.0
  • gunicorn==20.0.4
  • flask==1.1.2

Scraping sample

I will first show the file and explain afterward why I chose the setup like this.

Flask file

from flask import FlaskURL = 'https://google.com/'app = Flask(__name__)
@app.route('/')
def home():
result = scrape_site(URL)
return resultif __name__ == '__main__':
app.run(debug=False)

Scraping script

import timefrom bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.common.keys import Keys
def scrape_site(SAMPLE_URL):
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument("window-size=1920x1080")
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
# options.add_argument('--disable-dev-shm-usage') # Not used
driver = webdriver.Chrome(options=options)
driver.get(SAMPLE_URL) time.sleep(5) for t in range(10):
driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_DOWN)
for t in range(10):
time.sleep(1)
driver.find_element_by_css_selector('.additional_data').click()
src = driver.page_source
parser = BeautifulSoup(src, "html.parser")
driver.close() return src, parser

First of all, we define various driver options. We want to have it headless when we run it in the cloud. Yet we need the other flags as well to avoid multiple issues concerning

  • not finding elements
  • having startup issues
  • having tab crashing issues

The argument ‘ — disable-dev-shm-usage’ is a workable option, but I solve it differently in this situation, namely by adding the flags “-v /dev/shm:/dev/shm — shm-size=2gb” when calling docker run. (Details below)

Dockerfile

FROM python:3.7ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV FLASK_APP=app.py
ENV FLASK_ENV=development
# install system dependencies
RUN apt-get update \
&& apt-get -y install gcc make \
&& rm -rf /var/lib/apt/lists/*s
# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update
RUN apt-get install -y google-chrome-stable
# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
RUN python3 --version
RUN pip3 --version
RUN pip install --no-cache-dir --upgrade pipWORKDIR /appCOPY ./requirements.txt /app/requirements.txtRUN pip3 install --no-cache-dir -r requirements.txtCOPY . .EXPOSE 8080CMD ["gunicorn", "--bind", "0.0.0.0:8080","--timeout", "90", "app:app"]

Note the section on installing chrome and chrome driver.

When you scrape larger sites or data you might run into a timeout error. That’s why we add the ‘“ — timeout”, “90”’ flag in the Dockerfile CMD section.

Run everything

docker build -t YOUR_IMAGE_NAME .docker run -v /dev/shm:/dev/shm --shm-size=2gb -d -p 80:8080 YOUR_IMAGE_NAME

Then go to your localhost and check what the response is.

You can inspect your progress in the Docker Dashboard:

Everything works as expected; Image by author

When you have followed all the previous steps your containerized web scraper with Flask should work.

Summarizing and problem-related links

To summarize the problems I faced issues related to

  • not having the chrome driver (that’s why we need to install it separately within the Dockerfile)
  • startup and scraping issues (that’s why we add several chrome driver options)
  • certain elements are not found. most of the time those are loading issues (that’s why we add time.sleep() during the scraping)
  • too small shared memory (That’s we use the shm-related flags)

Stackoverflow links for detailed research

Disclaimer

I am not associated with any of the services I use in this article.

I do not consider myself an expert. I merely document things besides doing other things. Therefore the content does not represent the quality of any of my professional work, nor does it fully reflect my view on things. If you have the feeling that I am missing important steps or neglected something, consider pointing it out in the comment section or get in touch with me.

This was written on 28.02.2021. I cannot monitor all of my articles. There is a high probability that when you read this article the tips are outdated and the processes have changed.

I am always happy for constructive input and how to improve.

About

Daniel is an artist, entrepreneur, software developer, and business law graduate. His knowledge and interests currently revolve around programming machine learning applications and all their related aspects. To the core, he considers himself a problem solver of complex environments, which is reflected in his various projects.

Connect on:

Direct:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Daniel Deutsch

Written by

Data Science and Business Law. https://www.createdd.com/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Daniel Deutsch

Written by

Data Science and Business Law. https://www.createdd.com/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store