I have created a simple web spider using Laravel Dusk, this spider goes though all the links in a website and get’s its title, content and status code and stores it in local database.
What is Laravel Dusk ?
Laravel Dusk is an end-to-end browser automation tool provided by Laravel. This official package has the ability to visit your web-application or any other website same in the browser, which is very similar to an actual user operating your website.
Although the primary purpose of dusk is automation testing it can also be used for web scraping.
What’s the use of this spider tool?
As an example to get started with scraping I have created this simple tool that goes through all the website in your application.
- Check broken links on your website.
- Check Title and meta description of all pages in one go. (Useful for SEO)
- Get al Link’s, Title and Content for a competitors website.
Uses are enormous
Installation of Laravel Dusk
Installation of dusk package is fairly simple. Add the dusk composer dependency to your laravel project.
composer require --dev laravel/dusk
Once the dependencies are installed, you can now go ahead and install the dusk which will generate default scaffolding in your project.
php artisan dusk:install
Preparing Migration and Database Table
Make sure your project is connected to a database. I am using a mySql database for this project.
To store the crawled data into the database we just need a single table named pages. Let’s generate a model and a migration file for the pages table.
php artisan make:model Page -m
Let’s modify the migration file to include the required columns so that it looks like this
Apart from the obvious column names, status will be used to store the HttpStatus code returned by the page url and isCrawled is used to track weather the page is been crawled or not.
Dusk Spider Test
Let’s start writing spider test in dusk by generating a new dusk test.
php artisan dusk:make duskSpiderTest
This content goes in duskSpiderTest file
Let’s understand the code in brief
- Specify the $startUrl and $domain as per the website you are trying to crawl.
- setUp method is used to refresh the database on each test run.
- I start the crawling insude urlSpider test method, which then calls the getLinks method.
- getLinks recursively processes the url, fetches all the links on current page and adds them to database table.
- isValidUrl , trimUrl are helper methods to check if the link is valid.
- Since dusk does not return Http status codes, we make use of get_headers php function to fetch those inside getHttpStatus method.
That’s about it !
You can run the dusk test from CLI
php artisan dusk --filter urlSpiderTest
If you want to see the test run in browser, comment off the headless mode in DuskTestCase.php class
Here is a little video of the spider crawling my blog in local environment
Details of crawled pages are available at pages table in our database.
You can make use of Eloquent pagination to display data in the table.
The code is available at
A simple website crawler written in laravel dusk. Contribute to tushargugnani/laravel-dusk-spider development by…
If you are looking learn more about laravel dusk , You can checkout tutorial series I have written on my blog
Feel free to ask me your questions/bug reports in the comment section 🙂