Learn web scraping using php!

What is web scraping ?
Web Scraping, also termed as data extraction, is a form of data mining and is a technique where large amount of the data or information is collected across internet.

Despite of the legal challenges ,scraping data from websites have been a controversial because the terms of some websites do not allow certain kind of data mining.

My attempt is to teach you how to extract the data from websites and not to steal data from someones website.

What will you learn in this tutorial ?

  • Data extraction from websites
  • Storing data to mysql database
  • Php classes for extracting document and extracting data from the Domdocument
  • Automating the script

Prerequisites:

  • Basic knowledge of Php
  • Mysql

Step 1 : Creating a database and table to store data

1.1 Create a database web_scraping.

1 Create a table `scrape_data`

CREATE TABLE IF NOT EXISTS `site_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(50) NOT NULL,
`author` varchar(50) NOT NULL,
`tags` varchar(50) NOT NULL,
`recent_posts` varchar(250) NOT NULL,
`entry_date` date NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `entry_date` (`entry_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

Step 2: Create Db.php file and insert following code:

<?php
class Db{
private $conn;
public $username = “root”;
public $dbname = “webscraping”;
public $password = “krd123”;
public $host = “localhost”;
public function getDbconnection(){
$this->conn = null;
try{
$this->conn = new PDO(“mysql:host=”.$this->host. “;dbname=”.$this->dbname, $this->username, $this->password);
}catch(PDOException $e){
echo “Error Occured while connecting to db”. $e->getMessage();
}
return $this->conn;
}
}

Step 3 : Creating a class WebScraping . Insert this code inside WebScraping.class.php file.

class WebScraping
{
// Declaring class variables and arrays
public $url;
public $source;
// Construct method called on instantiation of object
function __construct($url) {
// Setting URL attribute
$this->url = $url;
//passing the url to our function
$this->source = $this->getCurl($this->url);
// passing the return value from getCurl function
$this->pathObj= $this->getXPathObj($this->source);
}
// Method for making a GET request using cURL
public function getCurl($url) {
// Initialising cURL session
$ch = curl_init();
// Setting cURL options
// Returning transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Setting URL
curl_setopt($ch, CURLOPT_URL, $url);
// Executing cURL session
$results = curl_exec($ch);
// Closing cURL session
curl_close($ch);
// Return the results
return $results;
}
// Method to get XPath object
public function getXPathObj($item) {
// Instantiating a new DomDocument object
$xmlPageDom = new DomDocument();
// Loading the HTML from downloaded page
@$xmlPageDom->loadHTML($item);
// Instantiating new XPath DOM object
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath; //get xpath
}
}

Here we have created two function namely getCurl() and getXPathObj().

getCurl() is the method for making a GET request using curl and return the html document.

getXPathObj() : In this method, we have instantiated DomDocument, which return the $item as the html document using loadHml function. $xmlPageDom is passed to DOMXPath, which will help retrieving all the selected data on the html document.

If you are scraping xml page, please use loadXML function instead of loadHtml.

Step 3: Retrieving the data from the html document and inserting into database.

We will be retrieving the select data such as the posts author name, posts title, posts release date, posts tags, and recent posts which can be seen on the url “https://beingjaydesaicom.wordpress.com/2016/09/15/getting-started-with-github/&#8221;.

Insert this code in scrapper.php

<?php
require_once(‘WebScraping.class.php’);
require_once(“Db.php”);
//initializing the database object
$database = new Db;
$db = $database->getDbconnection();
//intializing the WeScraping object and retrieving the data from the html elements
$PostsData = new WebScraping(‘https://beingjaydesaicom.wordpress.com/2016/09/15/getting-started-with-github/&#8217;);
//query will evaluate the expression of the node inside the html document
// item(0)->nodeValue will give one value from that node
// Getting the posts title
$posts_title = $PostsData->pathObj->query(‘//h1[@class=”entry-title”]’)->item(0)->nodeValue;
// echo “string”;
// exit;
// Getting the posts author
$posts_author = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/ul[@class=”post-meta”]/li[@class=”author vcard”]/a’)->item(0)->nodeValue;
// Getting the posts release date
$posts_Releasedate = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/ul[@class=”post-meta”]/li[@class=”posted-on”]/time[@class=”entry-date published”]’)->item(0)->nodeValue;
// getting names of all the recent posts
//note we havent used item(0)->nodeValue as we need all the data within <a> tag
$recent_posts = $PostsData->pathObj->query(‘//div[@class=”widget-area”]/aside[@class=”widget widget_recent_entries”]/ul/li/a’);
if (!is_null($recent_posts)) {
$all_recent_posts = array();
foreach ($recent_posts as $post) {
$all_recent_posts[] = $post->nodeValue;
}
}
//getting all the tags of the post
//note we havent used item(0)->nodeValue as we need all the data within <a> tag
$tags = $PostsData->pathObj->query(‘//footer[@class=”entry-footer”]/div[@class=”meta-wrapper”]/ul[@class=”post-tags”]/li/a’);
//getting all the tags on the posts
if (!is_null($tags)) {
$posts_tags = array();
foreach ($tags as $tag) {
$posts_tags[] = $tag->nodeValue;
}
}
// using implode to function to get comma separated values of recent posts and tags before inserting it to db row,columns.
$inn_tags = implode(“,”, $posts_tags);
$ins_recent_posts = implode(“,”, $all_recent_posts);
//convert into date format Y-m-d to save format into mysql database
$entry_date = date(“Y-m-d” , strtotime($posts_Releasedate));
//insert query
$insert_db_query = “INSERT INTO site_data SET
title=:title,author=:author, recent_posts= :recent_posts, tags= :tags,entry_date=:release”;
//prepare the query
$exec = $db->prepare($insert_db_query);
//set the inputs and sanitize it properly
$title = htmlspecialchars(strip_tags($posts_title));
$release = htmlspecialchars(strip_tags($entry_date));
$author = htmlspecialchars(strip_tags($posts_author));
$al_recent_posts = htmlspecialchars(strip_tags($ins_recent_posts));
$al_tags = htmlspecialchars(strip_tags($inn_tags));
//bind parameters
$exec->bindParam(“:title”, $title);
$exec->bindParam(“:release”, $release);
$exec->bindParam(“:author”, $author);
$exec->bindParam(“:recent_posts”, $al_recent_posts);
$exec->bindParam(“:tags”, $al_tags);
if($exec->execute()){
echo “Data Inserted into db”;
}
else{
echo “<pre>”;
print_r($exec->errorInfo());
echo “</pre>”;
}
?>

Step 3: Automating the script (for linux users)

You will want to collect large amount of information on daily basis. Suppose if you are having ecommerce websites and want to run cron to extract product information, then cron would be much helpful ,as it will run without you being executing the script manually every single day.

Please install the cron using this link

3.1 Open the terminal

3.2 Run the command “crontab -e”.

3.3 If it asks you which editor you want to use, enter option 2, by default

3.4 Enter a line below the file

0 18 * * * /usr/bin/php -f /var/www/html/web-scraping/scrapper.php >> /var/www/html/web-scraping/log.txt

This will Execute your script everyday at 6 pm. Log.txt will log the script output.

And you are done!!

You can download the complete code from here.

Hence you have accomplished some of the following goals of the tutorials:

  • Retrieving dynamic data from web page
  • Inserting data to database

Here are some links which will help for further reference and usage of some function:

Please share if you like this tutorial!!

Cheers!!