Crawling website using Laravel Dusk Spider.

Tushar Gugnani
3 min readFeb 27, 2019

--

I have created a simple web spider using Laravel Dusk, this spider goes through all the links on a website and get its title, content, and status code and stores it in the local database.

What is Laravel Dusk?

Laravel Dusk is an end-to-end browser automation tool provided by Laravel. This official package has the ability to visit your web-application or any other website the same in the browser, which is very similar to an actual user operating your website.

Although the primary purpose of dusk is automation testing it can also be used for web scraping.

What’s the use of this spider tool?

As an example to get started with scraping I have created this simple tool that goes through all the websites in your application.

  • Check broken links on your website.
  • Check Title and meta description of all pages in one go. (Useful for SEO)
  • Get al Link’s, Title and Content for a competitor's website.

Uses are enormous

Installation of Laravel Dusk

The installation of the dusk package is fairly simple. Add the dusk composer dependency to your laravel project.

composer require --dev laravel/dusk

Once the dependencies are installed, you can now go ahead and install the dusk which will generate default scaffolding in your project.

php artisan dusk:install

Preparing Migration and Database Table

Make sure your project is connected to a database. I am using a MySQL database for this project.

To store the crawled data into the database we just need a single table named pages. Let’s generate a model and a migration file for the pages table.

php artisan make:model Page -m

Let’s modify the migration file to include the required columns so that it looks like this

Apart from the obvious column names, the status will be used to store the HttpStatus code returned by the page URL and isCrawled is used to track whether the page is been crawled or not.

Dusk Spider Test

Let’s start writing spider test in dusk by generating a new dusk test.

php artisan dusk:make duskSpiderTest

This content goes in duskSpiderTest file

Let’s understand the code in brief

  • Specify the $startUrl and $domain as per the website you are trying to crawl.
  • setUp method is used to refresh the database on each test run.
  • I start the crawling insude urlSpider test method, which then calls the getLinks method.
  • getLinks recursively process the url, fetches all the links on current page and adds them to database table.
  • isValidUrl , trimUrl are helper methods to check if the link is valid.
  • Since dusk does not return Http status codes, we make use of get_headers php function to fetch those inside getHttpStatus method.

That’s about it !

You can run the dusk test from CLI

php artisan dusk --filter urlSpiderTest

If you want to see the test run in the browser, comment off the headless mode in DuskTestCase.php class

Here is a little video of the spider crawling my blog in local environment

Displaying Results

Details of crawled pages are available at the pages table in our database.

You can make use of Eloquent pagination to display data in the table.

Crawled Results via Dusk Spider.

The code is available at

If you are looking to learn more about laravel dusk , You can check out detailed udemy course which covers automation testing, web scraping and creating browser bots.

Feel free to ask me your questions/bug reports in the comment section 🙂

--

--