Cloud Web Scraper — The Web Server

Hugh Gallagher
Analytics Vidhya
Published in
5 min readJan 19, 2020

This is part 2 of my web scraping series, but it can be taken independently too. If you’re new here: welcome! If you’re here after part 1: welcome back!

We’ll be looking at setting up a virtual machine (VM) for the purposes of visualising data collected by our web scraper. But it could be used for any kind of HTML based website.

GIF of a person visualising a web page

This setup is specifically on Oracle Linux 7.7, which will only affect some naming conventions when installing packages (when compared to Ubuntu-like distros), such as using ‘yum’ instead of ‘apt-get’.

Server Setup

Having set up your VM, to your own specifications or following my set up process from my previous article, make sure it is currently running. Next we will gain remote access to it through SSH. If you don’t know how to, check out this short article I’ve written.

Now we will want to install the package to run our web server. This is very straight forward and only requires one command:

$ sudo yum install httpd

Once installed, it’s time to begin configuring the web server. To start we need to find the location of the configuration file (httpd.conf). By default this should be ‘/etc/httpd/conf/’, but this could differ for you. To be sure run this command (with a capital ‘V’):

$ httpd -V

The result will be the version of the server installed, as well as information on several defined variables. You want to look for the line:

 -D HTTPD_ROOT="[location of root here]"

Within the “[location of root here]” directory you should find the ‘conf’ directory, and within that the ‘httpd.conf’ file. Here there are a few key lines to be edited/added. Remember to prepend ‘sudo’ to your call to edit the file; e.g. ‘$ sudo nano [location of root here]/conf/httpd.conf’ . If you’re using the ‘nano’ text editor you can use ‘ctrl+w’ to open a search bar to search through the text and find the necessary sections.

Ensure that under the block of commented text beginning “Listen: Allows you to bind Apache…” you add or edit the line to read ‘Listen 0.0.0.0:80’. Or if you’d prefer you could change from the default port of 80 to listen on to another. If you decide to do this: ensure that it will not interfere with anything else listening on that other port, and keep a note of the port you chose for later.

Next scroll down to (or search for) the block of comments beginning “DocumentRoot: The directory out of which…”. Add or edit the first line following this to read: ‘DocumentRoot “/var/www/html”. This is the directory we will place our html files into. Next you will need to create this directory with the following:

$ sudo mkdir /var/www
$ sudo mkdir /var/www/html

For the sake of testing, add an “index.html” file to the html directory. If you don’t have one that you want to use just copy this:

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
Hello, this is a test file
</body>
</html>

and add it to a new file as follows:

$ cd /var/www/html
$ sudo nano index.html
[right click the mouse to paste the copied text]
[press ctrl+x, then y when prompted to save]

Now we have the web server set up, with a test file in place. Let’s check that the server is actually running. Run the command ‘service httpd status’. Look for the returned line beginning “Active:”. It will either tell you that the service is “inactive” or “active (running)”. If it is inactive, simply run the command ‘sudo service httpd start’. Then run the status command again.

Changes to Security List

Now that we have the server in place, let’s head over to the Oracle cloud site (if you’re following on from my part 1 of this series) to change a few security settings so that the web pages on your server can be accessed by the outside world. This is necessary if, like me, you want to be able to view your web page(s) through your computer’s browser.

Image showing the dropdown menu on the Oracle cloud website

Once you’re back and logged in, open up the dropdown menu from the top left of the screen. Move down to ‘Networking’ and click on ‘Virtual Cloud Networks’. Unless you’ve made any changes yourself, this will bring you to a table with one network listed — something along the lines of ‘VirtualCloudNetwork-…’ followed by a string of numbers relating to its creation date and time. Click on this, then ‘Public Subnet’ listed on the next page, and finally on ‘Default Security List for [name of network here]’ on the page after that.

Image of add ingress rules menu
New Ingress Rule Settings

Now this is what we’re looking for: the ‘Ingress Rules’ table, and the ‘Add Ingress Rules’ button. Click that button now.

Our new rule is going to allow external access on port 80, or whichever port you chose above to listen on. This will allow it to provide the web page we created to anyone with the IP address. Just copy the settings provided in the image above/to the left (depending on your device), changing the port number as necessary. Then hit ‘Add Ingress Rule’ and you’re network is sorted.

Finally, before we can access the webserver, head back to your VM and run the following command:

sudo firewall-cmd --add-service http --permanent

This allows the VM to accept HTTP traffic, and the ‘--permanent’ makes this persist between VM restarts.

Having covered the VM and database setup procedures, and setting up the web server, my next article will (finally) get to the web scraping! Stay tuned for that in a couple of weeks.

Questions, comments? You can find me over on LinkedIn.

* All views are my own and not that of Oracle *

--

--