How to Create Cronjob to Scrape Covid-19 Latest Data using PHP, Curl and XPathDOM

Hanief
Remote Worker Indonesia
4 min readSep 13, 2021

Good day friends!

Today we will explore how to create a simple application that will retrieve the latest COVID-19 data daily from the worldometers.info site.

In this article, we will use my old programming language pal — PHP — and XPathDOM and then the process will run automatically with PHP Automation / Cron / Scheduler every midnight (00:00) and noon (12:00).

Okay, the first step is to analyze the website from which the data will be taken (here) and how the data structure is. Let’s look at it together.

Based on data that showed and provided on the website, We will create a simple application that will retrieve the latest COVID-19 data based on the country requested by a user, make it clean and then save it in the database, and — of course — for the finishing, we will create a PHP Automation using cronjob so the code will be running automatically at midnight and noon.

Okay, let’s next look at the HTML code and analyze how to get the data we want from the site.

To get HTML XPATH / Element data, you can do it manually or use one of chrome developer tools called “Select the Element in the page to inspect it”. To use it, press Ctrl + Shift + c and then click on the website (in this case I clicked on USA row). Chrome Developer Tools plugin will be showed and show the HTML elements of the website. Right-click on the “<tr>” element, then select “Copy”, click on one of the options. This is what I get using Chrome Developer Tools.

Copy > Copy selector

#main_table_countries_today > tbody:nth-child(2) > tr:nth-child(5)

Copy > Copy XPATH

//*[@id="main_table_countries_today"]/tbody[1]/tr[5]

Copy > Copy Full XPATH

/html/body/div[3]/div[3]/div/div[6]/div[1]/div/table/tbody[1]/tr[5]

I think we will combine it with manual analysis, so here is the XPATH we will use:

//table[@id='#main_table_countries_today']/tbody/tr

And after we get the data in “tr”, we will break the data on “td” tag like this (number represent the column in the table):

  1. rank (covid rank cases)
  2. country name
  3. total cases
  4. new cases
  5. total deaths
  6. total recovered
  7. etc…

Ok, For the second step, let’s start the code.

First we will create a function to get website HTML using PHP and CURL.

Call the function using URL parameter: https://www.worldometers.info/coronavirus and put the HTML we got to DOMDocument.

Let’s add a parameter catcher (argv — because I use command line) to get the parameter user inputted. In this code, we use parameter: country_name.

Let’s get the data and then put it to an array and try to print it using var_dump function.

Let’s try to run it on CLI or command:

php /var/www/remoteworker/covid.php USA

This is the result we got:

array(7) {
["country_rank"]=>
string(1) "1"
["country_name"]=>
string(3) "USA"
["total_cases"]=>
int(41854465)
["new_cases"]=>
int(1103)
["total_deaths"]=>
int(678001)
["new_deaths"]=>
int(13)
["total_recovered"]=>
int(31871869)
}

Ok, looks good.

For the next step, let’s parse the data to make it a ‘clean’ data and then put it into the database.

Ups, I forgot, let’s create the database and the table for the data first.

Ok, let’s create a function to parse the data first, then put the data into the database.

The data are cleaner than before.

array(7) {
'country_rank' =>
string(1) "1"
'country_name' =>
string(3) "USA"
'total_cases' =>
int(41853362)
'new_cases' =>
int(0)
'total_deaths' =>
int(677988)
'new_deaths' =>
int(0)
'total_recovered' =>
int(31871868)
}

Let’s put the data into the database.

Here’s our code looks.

For the last step, let’s create the cronjob for the code, so the code can automatically run at midnight.

0 0 * * * root /usr/bin/php /var/www/remoteworker/covid.php USA
0 12 * * * root /usr/bin/php /var/www/remoteworker/covid.php USA

I will post the data after I get some data from it.

By the way, you can see this project repository on my Github here.

Thank you for your time to read. Let’s join us again next time as we explore another interesting case! 😉

--

--