How to crawl a website for fun and profit

What is crawling and why crawling ?

Crawling is what search engine bots do ,it is simply visiting websites searching for content and indexing,mining etc.. ,it is necessary if you want to make search engines but it can also exploited to for small scale tasks ,the most obvious and also useful task is to do automatic SEO or Search Engine Optimization ,if you don’t know what is seo ,it is simply a set of techniques applied by web masters to their websites to optimize them for search engines ,so they get better rank , for some website or to generate a sitemap for a particular website ,in this article i will show how to crawl a website using PHP language and some helpful packages .

How to crawl a website ?

Instead of reinventing the wheel and try to write code to crawl websites from scratch ,we’ll going to use some battle tested PHP libraries

Goutte

Goutte is a PHP library for crawling and scraping the web which is a wrapper around Guzzle and some powerful Symphony components :

BrowserKit,

DOMCrawler and CSSSelector

Those components make tasks, such as working with the dom ,simulating web browsers and select http elements ,easier and fun .

So first things first,download Goutte from this link on Github

Or using install it using composer(php package manager)

composer require fabpot/goutte
require_once ‘goutte.phar’;

use Goutte\Client;

$url_to_crawl = ‘http://www.techiediaries.com';

$client = new Client();

$crawler = $client->request(‘GET’, $url_to_crawl);

This snippet of code require the goutte library ,the it creates a Client instance and make a GET request to the url to crawl

If everything is OK the request() method should return an object of type Symfony\Component\DomCrawler\Crawler

Now before continuing you should check the response code

$status_code = $client->getResponse()->getStatus();

if($status_code==200){

//continue

}

Now you can extract different kind of informations from your document ,Goutte support XPath and CSS selectors

We need to follow some sort of algorithm to crawl our website ,there are many good algorithms ,the one will implement is called Depth limited search or simply DLS .

Step 1 : seed the initial url to the crawler and set the depth to limit visiting deep links

Step 2 : make sure the url is not broken and retrieve the initial document ,put the visited url on some data structure (associative array)

Step 3 : extract all links from the retrieved document,visit any unvisited link,make sure the link is not external and that the depth is not reached.

Step 4:for each found link repeat the previous steps from 2 to 4

the algorithm stops when the links depth is reached.

Now we should to be able to easily implement our Crawler

abstract class AbstractCrawler {
abstract protected function loop($url,$depth);
}

class Crawler extends AbstractCrawler

{

protected $originalUrl;

protected $maxDepth ;

protected $response ;

protected $client ;

public function __construct($url,$depth = 5)

{

$this->originalUrl = $url;

$this->maxDepth = $depth;

$this->response =array();

$this->client = new Client();

}

public function crawl()

{

if($this->baseUrl === null)

{

return;

}

$url = $this->baseUrl;

$depth = $this->depth;

$this->response[$url] = array();

$this->loop($url,$depth)

}

public function loop($url,$depth)

{

$client = $this->client;

if($client === null)

{

return;

}

try{

$client->followRedirects();

$crawler = $client->request(‘GET’, $url);

$statusCode = $client->getResponse()->getStatus();

$hash = $this->getPathFromUrl($url);

$this->response[$hash][‘status_code’] = $statusCode;

if($statusCode === 200)

{

$content_type = $client->getResponse()->getHeader(‘Content-Type’);

if(strpos($content_type,’text/html’) !== false)

{

$this->extractTitleInfo($crawler, $hash);

$childLinks = array();

if (isset($this->links[$hash][‘external_link’]) === true && $this->links[$hash][‘external_link’] === false) {

$childLinks = $this->extractLinksInfo($crawler, $hash);

}

}

$this->links[$hash][‘visited’] = true;

$this->traverseChildren($childLinks, $depth — 1);

}

}catch(Exception $e)

{}

}

protected function traverseChildren($links,$depth)

{

if($depth === 0 )

return;

foreach($links as $url => $info )

{

$hash = $this->getPathFromUrl($url);

if (isset($this->links[$hash]) === false) {

}else

{

}

if (isset($this->links[$hash][‘visited’]) === false) {

$this->links[$hash][‘visited’] = false;

}

}

}

}

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.