Building a PHP Package

Or well, a web scraper delivered as a package!

I’ll start with what web scraping is. You see, most websites do not offer the functionality to save a copy of the data which they display to your computer. The only option then is to manually copy and paste the data displayed by the website in your browser to a local file in your computer — a very tedious job which can take many hours or sometimes days to complete.

Web Scraping is the technique of automating that process.

Now, let’s jump right into how to build a simple web scraper as a PHP package:

Step 1:

Create a directory, let’s call it webscraper

Step 2:

cd into webscraper directory and create 2 more subdirectories namely src and tests. The src sub-directory is where all our logic would be and the tests, well, the tests!

Step 3:

Initialise git by running git init, follow the prompts to set up. This will create a .git file in the root directory

Step 4:

Initialise composer by running composer init. The following prompts are pretty minimal, just follow through. This will create a composer.json file in the root of your project folder. I’d recommend your composer.json looked like this:

{
“name”: “yourpackagename/scrapa”,
“description”: “A simple php web scraper package”,
“license”: “MIT”,
“keywords”: [“your own comma-separated keywords”],
“authors”: [
{“name”: “yourgithubusername”,“email”: “yourgithubemail”}],
“require”: {},
“require-dev”: {
“phpunit/phpunit”: “^4.8”
},
“autoload”:{
“psr-4”:{“YourNameSpace\\” : “src/”},
“classmap”: [“src/”]
},
“autoload-dev”:{
“psr-4”:{
“YourNameSpace\\Test\\” : “tests/”
}
}
}
Step 5:

now run:

composer update
Step 6:

Remember to .gitignore the vendor directory, build directory and composer.lock file. You can add the following to your .gitignore file

/vendor/
/build/
composer.lock
Step 7:

Almost there, create a .travis.yaml, phpunit.xml file all in the root of your project.

.travis.yaml
phpunit.xml

Now head on to github, create a repo (don’t add a readme, simply give it a name and hit create) Copy the SSH link and run this on your terminal:

git status                               // check status
git add . // add all to staging area
git commit -m "Initial commit" // commit code
git push origin master //push to remote master

All your code will be in the src folder (remember we mapped this folder in the composer.json file). Create two files in the src folder named XPathObject.php and Scrapa.php.

XPathObject.php: This class uses cURL to request and download a webpage. The downloaded webpage is converted to XML DOM object.

Scrapa.php: This class uses the XPath object returned so you can query the object to get to the DOM element(s) you’re interested in. It has two methods, the toStringScrapDOM and toArrayScrapDOM methods.

Take a look at the whole code on my repo https://github.com/andela-fokosun/webscrapa.

Let’s be practical:

So i want to scrap a webpage, a youtube channel about page for instance and i am interested in getting the usernames of all social media accounts of the channel. So here’s what comes to mind first. Well, head on to the page and inspect elements (https://www.youtube.com/user/RihannaVEVO/about)

A reasonable approach might be to reach all:

hrefs of a class=”about-channel-link> under ul class=”about-custom-links”> right!

Looks reasonable to me.

So the Scrapa class creates an instance of the XPathObject with the url you want to scrap via the constructor.

cURL helps us reach the webpage

To get the elements we want, we query packtPage

Like in the example we have in the inspector, to reach the elements we are interested in:

$url = 'https://www.youtubecom/JustinBieber/about';
$query = '//ul[@class="about-custom-links"]//a[@class="about-channel-link "]/@href';

$scrap = new Scrap($url, $query);

Finally, toStringScrapDOM uses a php implode function to parse whatever is returned to a string and toArrayScrapDOM returns an array.

Here we go, we have our imperfect web scraper almost ready to be delivered as a package!

In the spirit of TDD, let’s write a test (it’s supposed to be the first thing to do, sorry its coming now, lets do it anyways)

Head on to your tests directory, create a ScrapaTest.php file and dump this in there:

use YourNameSpace\Scrapa;
class ScrapaTest extends PHPUnit_Framework_TestCase 
{
public function testEquals()
{
$url = ‘https://www.youtube.com/user/RihannaVEVO/about';
$webpage = new Scrapa($url);
$this->assertEquals($url, $webpage->url);
}
}

Now run vendor/bin/phpunit.

Commit your changes to your repo, ensure your build passes and head on to packagist, submit your package (follow the prompts).

How about trying to check that your package installs correctly when someone requires it in their app. They are most like going to encounter something like this:

To get this fixed, head on to your github repo, draft a new release name it 0.0.1, update your package on packagist if you have not set it up for automatic update.

composer require “yourpackagename/scrapa: 0.0.1”

Alas, our simple web scraper finally delivered as a package!