Cloud Web Scraper — The Code

Hugh Gallagher
Analytics Vidhya
Published in
10 min readFeb 18, 2020

Welcome back again (or possibly for the first time). Continuing on the journey to set up a cloud based web scraper, I’ll finally be getting around to the actual code to facilitate the scraping process, as well as to set up the necessary database tables and APIs we’ll use. (The APIs will be built on Oracle APEX, so if you’re not using the Oracle cloud that section will be a little redundant for you).

In case you’ve stumbled across this and are wondering what I’ve already looked at, you will find that I have talked about setting up the Oracle Cloud environment, as well as setting up one of the two virtual machines (VMs) as a web server.

The Database

Diving straight in, let’s set up our database with all the tables we’ll need for this project! This is going to hold the data we scrape so that it can be reviewed, and tracked over time. Before we throw together all the SQL needed, head on over to your Data Warehouse page on Oracle Cloud. From there click on the database name, then move to the “Tools” tab, and click on the “Open APEX” button.

Image showing the dropdown menu on the Oracle Cloud home, with Data Warehouse highlighted

When this opens up you’ll be asked to create another (non-admin) user, and a workspace. In my example I’ve used “HUGH_INTERN” as both, so it’ll be necessary to replace that with your own username as appropriate. Once that’s done navigate back to the page with the “Tools” tab. This time we’ll click the “Open SQL Developer Web” option. Now we’ll set up the database.

Below you’ll find the SQL for each table. It is quite lengthy, so I’ll try to explain it all in brief. However, when you get to setting up the tables, it can all be run in one go.

-- table to hold each post information
CREATE TABLE HUGH_INTERN.POST_TABLE
(
URL VARCHAR2 (150) ,
DATE_OF_CREATION DATE NOT NULL ,
IDENTIFIER VARCHAR2 (60) NOT NULL,
)
TABLESPACE DATA
LOGGING
;
-- create and add primary key
CREATE UNIQUE INDEX HUGH_INTERN.POST_TABLE_PK ON HUGH_INTERN.POST_TABLE
(
URL ASC
)
TABLESPACE DATA
LOGGING
;
ALTER TABLE HUGH_INTERN.POST_TABLE
ADD CONSTRAINT POST_TABLE_PK PRIMARY KEY ( URL )
USING INDEX HUGH_INTERN.POST_TABLE_PK ;
-- create table to hold daily scraped data for each post
CREATE TABLE HUGH_INTERN.STAT_TABLE
(
COLLECTION_DATE DATE ,
POST_REACTS NUMBER (*,0) ,
POST_VIEWS NUMBER (*,0) ,
POST_URL VARCHAR2 (160)
)
TABLESPACE DATA
LOGGING
;
-- create and add primary key
CREATE UNIQUE INDEX HUGH_INTERN.STAT_TABLE_PK ON HUGH_INTERN.STAT_TABLE
(
COLLECTION_DATE ASC ,
POST_URL ASC
)
TABLESPACE DATA
LOGGING
;
ALTER TABLE HUGH_INTERN.STAT_TABLE
ADD CONSTRAINT STAT_TABLE_PK PRIMARY KEY ( COLLECTION_DATE, POST_URL )
USING INDEX HUGH_INTERN.STAT_TABLE_PK ;
-- create table to hold 'API key'
CREATE TABLE HUGH_INTERN.API_KEYS
(
KEY_VALUE VARCHAR2 (64)
)
TABLESPACE DATA
LOGGING
;
-- create and add primary key
CREATE UNIQUE INDEX HUGH_INTERN.API_KEYS_PK ON HUGH_INTERN.API_KEYS
(
KEY_VALUE ASC
)
TABLESPACE DATA
LOGGING
;
ALTER TABLE HUGH_INTERN.API_KEYS
ADD CONSTRAINT API_KEYS_PK PRIMARY KEY ( KEY_VALUE )
USING INDEX HUGH_INTERN.API_KEYS_PK ;

The first table we’re creating is POST_TABLE in the HUGH_INTERN workspace. This table holds 3 pieces of information; URL, DATE_OF_CREATION, and IDENTIFIER.

The URL is from the post you want to scrape. If you wish to also set up a web server to display some information (specifically in the same manner I have done), this needs to be obtained in a very specific way. I did this by copying the embed URL, so that later I could display an iframe, and removing the ‘embed/’ piece of text from the URL.

Image showing how to find correct URL

As you’ll see later the DATE_OF_CREATION will be generated through an API when you add a new post. Finally the IDENTIFIER is simply a short string of text you come up with to identify the post. Essentially just a human-readable name. For example the post I share with this article will likely be given the IDENTIFIER “Web Scraper The Code”.

The next table is the STAT_TABLE. This is accessed daily to log stats for each post. As such it requires the POST_URL (the same as URL, as described above), and the COLLECTION_DATE to uniquely identify each table row. Tied to each row are both the POST_REACTS, and POST_VIEWS. Which, simply put, are the number of reactions to the post and number of views of the post respectively.

Finally is the small API_KEYS table. You can decide for yourself if you need this. It’s a very simplistic way to add a little security to who can add to the database. All additions will be through an API, which in itself is an added layer of security. Here I saved a 64 character, hashed word to act as a key which must be provided to add to the database.

The APIs

Next we’ll head back on over to Oracle APEX to set up all our APIs. At the Oracle APEX homepage, open up the dropdown menu “SQL Workshop”, and select the “RESTful Services” option.

From here you’ll need to sign in to the workspace you created above, with the user created at the same time. Once you’ve logged in look for the section, around the centre of the screen, labelled “Modules”. Click the little plus symbol that should read “Add Module” if you hover over it.

Image showing the form to be filled in

You’ll be greeted by a form to be filled in. In my case I used almost the same name for both the ‘Module Name’, and the ‘Base Path’: “scrape” and “/scrape/” respectively. If you change the ‘Is Published’ slider it will change the availability of the API. The ‘Pagination Size’ refers to how many results the APIs will return, which can also be further refined in each API.

When you’re in the module menu, you’ll want to press the “Create Template” button to being making your API. You’ll need to provide a ‘URI Template’ name, we’ll start with one called “posts”. This will hold our GET and POST methods for the web scraper to use.

Image of handler creation menu

Once you’ve created the template, click “Create Handler” from within the template menu. You’ll be greeted with the menu in the image seen on the left. In the ‘Method’ dropdown menu, select GET, and in the ‘Source Type’ select Collection Query. Then in the ‘Source’ below the menu add the following:

select URL
from HUGH_INTERN.POST_TABLE JOIN HUGH_INTERN.STAT_TABLE
ON URL=POST_URL
WHERE COLLECTION_DATE = (SELECT MAX(COLLECTION_DATE) FROM HUGH_INTERN.STAT_TABLE WHERE URL = POST_URL GROUP BY POST_URL)
ORDER BY COLLECTION_DATE ASC

This returns the list of URLs in your database, ordered by the oldest scraped up to the most recently scraped. The reason for this ordering is that occasionally the python script will fail to scrape some pages due to LinkedIn timing out. Since it is designed to run every night it leads us to an ‘eventually correct’ method. Which is not ideal, but unfortunately cannot be helped.

Navigate back to the template to create another handler. This time we will create a POST method API. This one will be used to log our collected stats relating to each post. When you select a POST method, the Source Type will change to PL/SQL, this is what we want. The Source for this is as follows:

BEGIN
INSERT INTO HUGH_INTERN.STAT_TABLE(COLLECTION_DATE, POST_REACTS, POST_VIEWS, POST_URL)
VALUES (CURRENT_DATE,:GETREACT,:GETVIEW,:GETURL);
UPDATE HUGH_INTERN.POST_TABLE
WHERE URL = :GETURL;
COMMIT;
END;

Each variable above preceded by a colon (:) indicates that it is sent within the header of the POST request. ‘CURRENT_DATE’ will simply pass through the local time when the API is run, in the DATE format.

Next we’ll create another template to handle some of the web page visualisation. In this instance I have called it “visualise”. As before we’ll create a GET and POST method. I’ll provide the code, and then briefly discuss each.

--GET
SELECT URL
FROM HUGH_INTERN.POST_TABLE
WHERE DATE_OF_CREATION = (SELECT MAX(DATE_OF_CREATION) FROM HUGH_INTERN.POST_TABLE)

The above GET API will return the most recently added post that you are tracking. This is simply to visualise it alongside the most viewed and most reacted to posts on the web page.

--POST
DECLARE
API_KEY_COUNT NUMBER;
NEW_POST NUMBER;
BEGIN
SELECT COUNT(1) INTO API_KEY_COUNT FROM HUGH_INTERN.API_KEYS WHERE KEY_VALUE = :API_KEY;
SELECT COUNT(1) INTO NEW_POST FROM HUGH_INTERN.POST_TABLE WHERE URL = :GETURL;
IF (API_KEY_COUNT) = 1 THEN
IF (NEW_POST) = 0 THEN
INSERT INTO HUGH_INTERN.POST_TABLE(DATE_OF_CREATION, URL, IDENTIFIER)
VALUES (CURRENT_DATE,:GETURL,:GETIDENTIFIER);
COMMIT;
END IF;
ELSE
DBMS_OUTPUT.PUT_LINE('Invalid API Key');
END IF;
END;

The above POST API is used to insert new posts to be tracked into the database. This will be present on the web page. The DECLARE statement allows us to create variables in the PL/SQL code. The API_KEY_COUNT is to check that our (rather rudimentary) API key is unique and on the API table, by returning a value of one. In a similar fashion, the NEW_POST is to check that the post we are adding is not already in the database, by returning a value of zero. If both of these values are met the new post is added to the database.

Our last two templates are ‘maxViews’ and ‘maxReacts’. Which are to hold the GET handlers that will return the post with the most views, and the post with the most reactions respectively. Create these templates as before, and create a GET handler in each.

As these are both highly similar GET methods, I will provide the code for both and then discuss the general workings while pointing out the differences.

--maxViews
SELECT URL
FROM HUGH_INTERN.POST_TABLE
JOIN HUGH_INTERN.STAT_TABLE ON URL=POST_URL
WHERE POST_VIEWS = (SELECT MAX(POST_VIEWS) FROM HUGH_INTERN.STAT_TABLE)
-------maxReacts
SELECT URL
FROM HUGH_INTERN.POST_TABLE
JOIN HUGH_INTERN.STAT_TABLE ON URL=POST_URL
WHERE POST_REACTS = (SELECT MAX(POST_REACTS) FROM HUGH_INTERN.STAT_TABLE)

As you can see both GET methods will query your POST_TABLE joined to your STAT_TABLE to find which post has the most views or reactions (respectively) and returns the URL for the post it finds.

Python Code (The Web Scraping Bit)

So the python code is relatively lengthy (as is this article) so you can find it on my GitHub to peruse. I will ensure it is commented, but I will draw some of the more key components of it here (along with the relevant line number).

The key edits to get the python script working are as follows:

  • On line 11, replace [Full URL Here] with the full URL to your module. As below, click on the module name in the dropdown menu, then click the copy button beside the Full URL to copy it.
  • Edit the .env file that accompanies the Python file. Replace [YOUR PASSWORD HERE] with your LinkedIn password, and [YOUR EMAIL ADDRESS HERE] with the email address tied to your LinkedIn account.
Image showing how to copy full module url

Automation

So how are we going to automate the scraping process? Through the built in cron daemon by writing to the crontab file.

On your Web Scraper VM (if you’re following along my tutorials), save the python and .env files in the same directory (we’ll return to the html file in the next section). Make note of the path to the python file. This can easily be done by running the following command:

readlink -f [PYTHON FILE NAME]

Copy the path and we’ll add it to the crontab. To do that use the following command:

sudo EDITOR=nano crontab -e

The ‘EDITOR=nano’ is not necessary, but nano is my text editor of choice, so I am being explicit to ensure that is the one crontab opens in. Add the following lines to the file that opens:

PATH = [$PATH]:[Path to gecko]
0 2 * * * /usr/bin/python3 [Path to Python] >> [Path to Log] 2>&1

I have already discussed [Path to Python] has been explained. [Path to Log] is simply a file to save the output of the python file; e.g. any print statements, exceptions encountered, etc. You can decide on this yourself. You could simply save it to the same directory as the python file. Remember the path needs the name of the log file as well, something along the lines of ‘webscraper.log’ would suffice.

The PATH line I will explain now. [$PATH] is to be replaced by your path variable. This is obtained by running the following:

echo $PATH

Simply copy the output of this and use it to replace [$PATH]. The [Path to gecko] is to be replaced by the path to your geckodriver. This is installed as per my tutorial on setting up the cloud environment. If you’ve followed that tutorial it should already be in your path variable, but this addition in the crontab is for the sake of being precise to ensure it can be found by the python script.

Basic Web Page

Now finally we will look at setting up a very basic web page to view some of the results of our collection. The file for this is found on my GitHub along with the python and .env file. Much like the python file I will not place the code in this article, but simply discuss the few edits you need to make.

On lines 7 and 63 replace [MODULE_URL HERE] with your module URL as before. Then simply save the file to the location for your server’s web pages. If using the default Apache setup, as set up here, it should be ‘/var/www/html/’. Save it as ‘index.html’ and your web page should be good to go.

It will display your most viewed post, most reacted to post, most recent post, and a form to add posts to your database. This is done on a 2x2 grid, as seen in the above image.

I realise that this article has expanded a little more than intended, but this draws a close to my multi-part web scraper tutorial. I hope you enjoyed it!

Questions, comments? You can find me over on LinkedIn.

* All views are my own and not that of Oracle *

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Hugh Gallagher
Hugh Gallagher