Cloud Web Scraper — The Setup
If you’re new to the concept of web scraping, I’d suggest reading this article I wrote covering the basics. They’re the ideas I’ll be building on with this scraping system.
What I’ve put together started out as a somewhat clunky python script I would run locally to scrape some of my own stats from LinkedIn, and save the output to a spreadsheet. I have since worked to improve it, and now it is running on a cloud based virtual machine (VM). This machine saves the output to a database, which can then be visualised by a second VM running a small web server.
This visualisation shows the most recent post that has been scraped, the post with the highest number of views, and the post with the highest number of reactions. On top of that there’s a simple form to insert new posts into the scraping list.
In this article I’ll detail setting up the two VMs and the database. I’ll follow it up later with separate articles for the code and scripts for scraping and visualising the data, and for setting up a web server on one of the VMs.
Setting Up The VMs
First things first, we’ll be using the Oracle cloud (sticking to the free tier offerings), so you’ll need to sign up for an account. Once you have your account set up you should be met by the main page with a ‘Quick Actions’ menu. From here we’ll start setting up the VM Instances.
We have two VMs to set up, both with the same settings. As this is the case I’ll run through the setup once.
Start by clicking on the ‘Create a VM instance’ button. You’ll be brought to a page detailing the settings for the VM. The first thing to do here is to name your VM instance, to make things easier moving forward use meaningful names. For example, I’ll be referring to them as ‘WebScraper’ and ‘Visualiser’ throughout these articles.
Next make sure that the ‘Shape and Type’ has the ‘Always Free Eligible’ tag beside it (assuming you want to stay within the free tier). You’ll need to open the ‘Show Shape, Network and Storage Options’ menu anyway, but if this isn’t within the free tier you’ll need to select different ‘Availability Domains’ until an ‘Always Free Eligible’ shape is available in the ‘Change Shape’ menu. For the ‘UK-London’ area I’ve found that ‘AD 2’ should offer the correct shape.
Also within the ‘Show Shape, Network and Storage Options’ menu, almost at the bottom, is a choice of ‘Do not assign public IP address’ or ‘Assign public IP address’. You’ll need to change this to ‘Assign public IP address’ so that we can access the VMs remotely.
After you have the IP address set to public, you’ll need to scroll to the bottom of the page to upload your public SSH key. You can use whatever tool you want to to generate your public-private key pair, you’ll use these for your remote access. In my case I used ‘PuTTY Key Generator’. If you do as well, ensure that under the ‘Key’ dropdown menu you have selected ‘SSH-2 RSA key’.
Once your keys are generated, hit the ‘Save private key’ button and save the key somewhere safe. Then copy the public key from the box on the Key Generator window. Paste this into a plain text file and save it using the file type ‘.pub’. Hitting the ‘Save public key’ button saves the public key in a slightly different format to how it is presented in the window.
Back in the VM setup page hit the ‘Choose Files’ button, and select your public key. Once that is done hit ‘Create’ at the bottom of the page. You will need to wait a few minutes as the VM is provisioned and started up. Again, you will need to do this twice as we will be using two VMs.
Setting Up The Database
Next the database setup is more straightforward. Hit the quick action button, and when you redirect to the setup page name the database as you did the VMs. Next you’ll want to scroll down and make sure that ‘Choose a workload type’ is set to ‘Data Warehouse’, and ‘Choose a deployment type’ is set to ‘Serverless’.
Below both of these options is the ‘Configure the Database’ menu. Here you’ll find an ‘Always Free’ slider. Make sure this is turned on (a blue background behind the slider).
Finally you’ll need to scroll down to the next section and set your ADMIN password. Make sure you use a secure password, and that you remember it. Once this is set scroll to the bottom of the page and hit the ‘Create Autonomous Database’ button.
As I said before, I’ll cover the python, SQL, HTML, etc. in my next article. But I’ll cover the various packages and libraries needed for the WebScraper here in case you want to try some of these concepts on your own in the meantime.
The first thing to cover since we are using Oracle Linux (unless you changed from the default on your VMs), is that it is a Red Hat Linux. For the most part this distinction is inconsequential to us. However, what we need to be aware of is that we will not be using the usual ‘apt-get’ package handler. In Red Hat distributions instead the handler is ‘yum’. A revelation that brought a little joy and humour to my day.
Packages and Python Libraries
First I’ll cover the packages we need, and then the python libraries (the scraping will be carried out using a python script). To install this you’ll need to access the VM instance. To do so you can SSH into the instance; if you don’t know how to check out this short article I’ve written on how to do so.
$ sudo yum install python3
$ sudo yum install python3-pip
$ sudo yum install firefox
$ sudo yum install xorg-x11-utils.x86_64
$ sudo yum install xorg-x11-server-Xvfb.x86_64
‘python3’ is to be able to execute python code, and ‘python3-pip’ is to install python libraries. We’ll be using the ‘firefox’ browser to navigate LinkedIn in order to scrape the information we’re looking for.
The mysterious looking ‘xorg-x11-utils.x86_64’ gives us access to the ‘screen’ command. This allows us to start a ‘screen’, a terminal instance un-tethered from the terminal window, where we can set code running. This allows us to circumvent the issue that when the terminal window closes, it stops any running code or commands. This is very useful when it comes to testing the scraping, as the code can take longer than the connection timeout to execute, and thus fails when the terminal disconnects.
Finally the second xorg, ‘xorg-x11-server-Xvfb.x86_64’, allows us to run a virtual display. This is important as the VMs we set up are headless (no monitor/other display), and Firefox needs a display to interact with.
After all of the yum installs, it’s time for python libraries. This is where we’ll use our ‘pip3’ tool.
$ sudo pip3 install pyvirtualdisplay
$ sudo pip3 install selenium
$ sudo pip3 install python-dotenv
$ sudo pip3 install webdriverdownloader
‘pyvirtualdisplay’ allows us to interact with our virtual display from our python script. This is a basic requirement for operation, but no less important. ‘selenium’ is arguably the most important library we’re downloading as it is what will take control of the browser to facilitate our scraping.
The use of a ‘.env’ file is not necessary, but it makes sharing scripts and files a little cleaner. What I will include in the ‘.env’ file could simply be hard coded into your python script if you wish. The ‘python-dotenv’ library is what will allow us to use a ‘.env’ file with our python script.
Another useful-but-not-essential inclusion is ‘webdriverdownloader’. For Firefox to work properly through Selenium we will require a web driver, in this case Gecko Driver. While you could go and download and install it manually, this library just simplifies the process. After it is installed run the following:
$ python3
>>> from webdriverdownloader import GeckoDriverDownloader
>>> GeckoDriverDownloader().download_and_install()
Once that’s finished running, you will have Gecko Driver downloaded and installed! The last thing to do it to add ‘geckodriver’ to the PATH variable. That is done as follows:
$ readlink -f geckodriver
Copy the output of this statement by highlighting the output, this adds it to the clipboard. In the next statement when you get to ‘paste output’ simply right click the mouse and this will paste the last item copied to the clipboard.
$ export PATH=$PATH:'paste output'
Finally, so that the Gecko Driver can be accessed later by the VM when we automate the running of our python script, copy the PATH variable into the sudo user’s ‘crontab’ file:
$ echo $PATH
Copy the output of this as before.
$ sudo EDITOR=nano crontab -e
This will open up the sudo user’s crontab file, with nano as the editor. You can use whichever editor you are most comfortable with, however I will be assuming nano usage. Add the following to the file:
PATH='paste output'
Then press ctrl+x to exit the editor, pressing ‘y’ to save the changes you’ve made. This has our WebScraper VM all ready for the python script to come.
Check back in a couple of weeks when I discuss setting up the web server on the Visualiser!
Questions, comments? You can find me over on LinkedIn.
* All views are my own and not that of Oracle *