In this guide:
- Setting up a Digital Ocean droplet with Ubuntu 16.04.
- Installing all the software and dependencies we need including a headless Chrome.
On my quest to learn, I wanted to eventually be able to write beginner- friendly guides that really help make one feel like they can improve. Normally, we’ll get hit with very long documentations and a get started section that shows us the surface, but don’t teach us some real world possibilities right away before we invest more time into the tools. This guide will assume you have limited knowledge around the command line, the Python 3 language, and HTML. Let’s consider the user story:
Notice that the data is wrapped by a <script> tag? That data is in JSON format and is rendered to HTML upon loading. We have the option to parse the JSON data, but let’s say we want to extract based on what we see or generated. Let’s write the steps on how we’d do that:
- Go to www.munchery.com. (be sure to check their robots.txt and terms before proceeding).
- Get through the landing page by entering an email address and zip code, and then click on the submit button to get to the Main Menu page.
- On the Main Menu Page, check if the image, name and price of each dish exists.
However, before we can do the above we need to set up our server and environment.
Setting Up Our Environment and Crawl
For our environment, we’ll be using a Digital Ocean (D.O.) virtual server or what D.O. calls them, a droplet.
- Go to www.digitalocean.com and log in or sign up.
- If you’re new to D.O., feel free to use my referral link to get $10 for free.
- Create a $5 Droplet in D.O. with Ubuntu 16.04:
- After creating your droplet, you should get an email with your server credentials. If you set up your SSH Keys with D.O. (highly recommended), great, you can skip the next part about setting your password. Pull up your terminal and log into your server with this command, replacing “your_ip_address” with your IP address:
- It will prompt you to agree, type “yes”, and then input the password from the D.O. email (you can copy&paste it). The server will then prompt you to change your password.
- Now that you’re logged into your server, let’s update your system and install unzip. Run each of the 3 commands:
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get -y install unzip
- Now we’ll download, unpack, and install the latest Google Chrome browser. Run each of the 7 commands:
cd ~sudo apt-get install libpango1.0-0
sudo apt-get -f installwget -c "https://www.slimjet.com/chrome/download-chrome.php?file=lnx%2Fchrome64_54.0.2840.71.deb"sudo dpkg -i download-chrome.php?file=lnx%2Fchrome64_54.0.2840.71.deb
sudo rm download-chrome.php?file=lnx%2Fchrome64_54.0.2840.71.deb
sudo apt-get install -y -f
The below commands outlines the latest Chrome, but it doesn’t work with Selenium 2.25. Skip the code below:
sudo dpkg -i google-chrome-stable_current_amd64.deb
sudo rm google-chrome-stable_current_amd64.deb
sudo apt-get install -y -f
We’ll then want to get Chromedriver so we can run Chrome headlessly. Get the latest Chromedriver version with:
- At the time of writing this post, the latest version is 2.25. I recommend using the 2.25 version before trying the latest version (EDIT: Latest version as of 2018–01–10 is 2.34 and it works). Download the 2.25 or latest Chromedriver by running the below in your terminal, replacing the version if it’s different from 2.25. It will download in a new folder you’ll create “/var/chromedriver/”. Run the 3 commands:
sudo mkdir /var/chromedriver
cd /var/chromedriverwget "http://chromedriver.storage.googleapis.com/2.25/chromedriver_linux64.zip"
- Unzip the Chromedriver with the below:
- Now we want to get PIP and the packages needed with the following command. Go here to learn more about PIP:
sudo apt-get -y install python3-pip python3-dev build-essential libssl-dev libffi-dev xvfb
- Upgrade PIP with the command:
pip3 install --upgrade pip
- Installing Virtualenv will allow us to create a virtual environment and install any Python packages in it without affecting our system’s Python. Go here learn more about Virtualenv:
sudo pip3 install virtualenv
- Set up a virtual environment with the following command. It will create the folder /var/venv/:
- Activate the virtual environment with:
- You should now be in your virtual environment, identified with the (venv) tag to the left. You can check the pip version with “pip -V” and the python version with “python -V”. Both should mention python 3.5.
- Almost there! Let’s get Selenium and PyVirtualDisplay. In your venv, run:
pip install selenium==3.0.0
pip install pyvirtualdisplay==0.2.1
- Your environment is now set up. Let’s get a script in for you to run. Create a ‘crawlers’ folder and create a ‘munchery_spider.py’ with your favorite text editor. I use vim:
- Copy and paste the code below into the munchery_spider.py file. To be able to type or add to the file in vim, start by pressing the ‘a’ key. To save, press ‘esc’ and then type in ‘:wq!’ (without the single quotes, but with the colon) and press enter. The code:
Run the crawler with the command (reminder: we’re still in (venv)!):
That’s it! You’ve successfully run a crawler on Munchery.com! Your output should look like this:
Breaking Down the Crawler
I’ll break down the munchery_spider.py crawler provided above and on Github. Pasting the steps from above on what we want to do here:
1. Go to www.munchery.com.
2. Get through the landing page by entering an email address and zip code, and then click on the submit button to get to the Main Menu page.
3. On the Main Menu Page, get the image, name and price of each dish.
In the script, it will run in this order:
- Lines #95–96: Call the MuncherySpider class and then run lines #79–91.
- Line #80: Start the driver from lines #16–21 where an invisible/headless Chrome browser will be open with a display of 800x600.
- Line #82: Runs lines #31–34 where the browser will go to the url passed. In this case, it’s “http://www.munchery.com/”. Addresses Step 1 above.
- Line #83: Lines 37–46 will attempt to find the class=“signup-login-form” element, type in an email into the class=“user-input email” field, type in the zip code into the class=“user-input zip-code”, and finally click the element with the class=”extra-large orange button” within the parent “signup-login-form” element. Addresses Step 2:
- The headless browser should now be in the Main Menu/Dishes page.
- Line #84: Runs lines #48–55 where the driver will grab all “li” elements within the parent “ul” element with the class=”menu-items row”:
- Line #51: Runs lines 57–77 where within the selected “li” element, parse out the image, title, and price (lines #63–65). Addresses Step 3:
- Line #86: Close the driver and browser from lines #24–28.
- Lines #99–100: Prints out the results:
The above can be used as a base for automated front-end testing. Just as you would click around to see if your website works, you can do just that with Selenium. If you want to learn more about web crawling, I recommend checking out Scrapy. This isn’t legal advice, but keep in mind to not reproduce copyrighted content and follow some best practices. As always, happy learning!
If you like this guide and would like to see more, check out my blog at