Create a web scraper with FreePascal & Lazarus
A friend of mine stumbled on the https://books.goalkicker.com/ website, he was impressed, and knowing I’m always on the lookout for good reference material, he suggested I check it out as well.
I did check it out, and it looks to be both good material and plentiful!
I like having copies of this kind of thing saved locally so I can reference when I’m not online.
I wasn’t thrilled at the idea of manually downloading 48 PDFs, and then digging around for new ones when they’re released though.
Being both lazy and a programmer (A dangerous combination!) I decided it would be handy to have a program that would do the job for me, so I wrote a scraper to grab the PDFs automagically, and figured it would be a neat little project to share with you all.
I’m a big fan of FreePascal and Lazarus, and everything we need is included with the Lazarus install so it’s a great fit for this project.
If you don’t have FPC/Laz, go grab the latest FPCUpDeluxe installer. It’s what I use to setup FPC and Lazarus, as well as set up cross compilation for Windows, Mac, and Linux.
What IS a web scraper?
A web scraper is a piece of software that goes to a website, looks for certain things, and performs some action.
Usually the action is copying some text or grabbing images, but in this case we’re going to download PDF files that the nice people at goalkicker.com put together.
On the main page is 48 images you can click on. Each represents one manual, and clicking it will take you to a page with more information and a Download link.
Let’s jump right in
The flow of actions are pretty straight forward
- Check if the target folder exists and create it if needed
- Download the base page
- Find all the individual book pages
- Find the filename on those pages
- Download the files
Here’s what you’ll end up with. A simple form with a button and a memo to output text.
You can also find it here https://github.com/MFernstrom/goalkicker-scraper-first-version
Here’s the program in its entirety. Take a quick look, and then we’ll go over the details.
unit main;{$mode objfpc}{$H+}interfaceuses
Classes, SysUtils, Forms, Controls, Graphics, Dialogs, StdCtrls,
fphttpclient, regexpr;type{ TForm1 }TForm1 = class(TForm)
Button1: TButton;
Memo1: TMemo;
procedure Button1Click(Sender: TObject);
procedure downloadBook(bookname: String);
privatepublicend;var
Form1: TForm1;implementation{$R *.lfm}{ TForm1 }const
baseUrl = 'https://books.goalkicker.com/';var
targetDirectory: AnsiString;procedure TForm1.Button1Click(Sender: TObject);
var
page, bookname: AnsiString;
re: TRegExpr;
begin
targetDirectory := GetUserDir + 'downloads' + DirectorySeparator + 'GoalKickerBooks' + DirectorySeparator;if Not DirectoryExists(targetDirectory) then
CreateDir(targetDirectory);// Grab the base page
page := TFPHTTPClient.SimpleGet(baseUrl);// Find all book urls
re := TRegExpr.Create('<a href="([\w]+)/"');
try
if re.Exec(page) then begin
bookname := re.Match[1];
downloadBook(bookname);
while re.ExecNext do begin
bookname := re.Match[1];
downloadBook(bookname);
Application.ProcessMessages;
end;
end;Memo1.Append('');
Memo1.Append('All books downloaded');
finally
re.Free;
end;
end;procedure TForm1.downloadBook(bookname: String);
var
page: AnsiString;
re: TRegExpr;
begin
// Get page
page := TFPHTTPClient.SimpleGet(baseUrl + bookname + '/index.html');// Grab PDF url
re := TRegExpr.Create('location.href=''([\w]+.pdf)''');
try
if re.Exec(page) then begin
Memo1.Append('Downloading ' + baseUrl + bookname + '/' + re.Match[1]);
TFPHTTPClient.SimpleGet(baseUrl + bookname + '/' + re.Match[1], targetDirectory + re.Match[1]);
end;
finally
re.Free;
end;
end;end.
Create a new GUI project in Lazarus to have the same blank canvas I’m starting with.
Add a button and a memo to the form. We’re keeping this version really simple without any counters or progress bars. We’ll make a fancier version in a future post with more bells and whistles.
I set up the memo with anchors to auto-resize with the window, and auto scrollbars.
We know we need to grab pages, download files, and look for patterns in text, so let’s add fphttpclient
and regexpr
to our uses section like so
uses
Classes, SysUtils, Forms, Controls, Graphics, Dialogs, StdCtrls,
fphttpclient, regexpr;
Let’s add a constant for the base url since that’s not going to change.
const
baseUrl = 'https://books.goalkicker.com/';
I also have a global variable for targetDirectory: AnsiString;
because we’ll do a little magic to keep things simple.
Double click on the button to create a Click procedure, add a var
section with page
, bookname
, and re
var
page, bookname: AnsiString;
re: TRegExpr;
In your button’s Click procedure, let’s start with the path.
targetDirectory := GetUserDir + 'downloads' + DirectorySeparator + 'GoalKickerBooks' + DirectorySeparator;if Not DirectoryExists(targetDirectory) then
CreateDir(targetDirectory);
Nothing mind blowing going on here. We’re checking if the directory exists, and creating it if it doesn’t. (We are assuming there’s a downloads directory in the users home directory)
We’ll use SimpleGet()
from fphttpclient
to grab the page, so add this line: page := TFPHTTPClient.SimpleGet(baseUrl);
Looking at the page structure (Developer tools in a browser) we can see that the links for the individual pages follow a simple pattern. They look like this:<a href="DotNETFrameworkBook/" target="_blank">
Using rTest, a regex tool I made for FreePascal, it was easy to create a regex pattern to match these links.
I just copied the page source code into the editor, and started creating the regex pattern.
I ended up with a very simple pattern <a href="([\w]+)/"
which results in exactly 48 matches on that page. It’s very important that we only match the book urls, otherwise we’ll start spidering through pages we didn’t intend to!
With the pattern in hand, we set up the TRegExpr
re := TRegExpr.Create('<a href="([\w]+)/"');
Next up we need to loop through all the matches and download each book.
Add the following section
try
if re.Exec(page) then begin
bookname := re.Match[1];
downloadBook(bookname);while re.ExecNext do begin
bookname := re.Match[1];
downloadBook(bookname);
Application.ProcessMessages;
end;
end;Memo1.Append('');
Memo1.Append('All books downloaded');
finally
re.Free;
end;
If there are any matches we grab the first one and call the downloadBook procedure which we’ll create shortly.
Then we say “For all additional matches, do the same thing”.
When we’re done we put a string into the memo to let the user know.
For the last pieces of the puzzle we need to look for the filename and download the file.
In your form declaration, add this procedure under the Button1Click
procedure downloadBook(bookname: String);
Place your cursor on downloadBook and press CTRL + Shift + C
to auto-create the procedure.
Scroll down to your new shiny procedure to fill it in.
Start with a var block
var
page: AnsiString;
re: TRegExpr;
And then let’s grab the page for the specific book
page := TFPHTTPClient.SimpleGet(baseUrl + bookname + '/index.html');
Same technique as before, we’re just downloading the content and sticking it in the page variable.
We’re using another regex in this procedure, so let’s come up with a pattern to match.
Again using rTest (You don’t have to use it but I find it handy) just copy the whole page source code into the editor.
Looking at the source, we see that the button that triggers a download looks like this <button class="download" onclick="location.href='DotNETFrameworkNotesForProfessionals.pdf'">Download PDF Book</button>
So playing around with rTest we come up with this simple pattern: location.href='([\w]+.pdf)'
Remember that we have to escape those ‘
though. The final line looks like this re := TRegExpr.Create('location.href=''([\w]+.pdf)''');
We’re almost there!
Now let’s look for a match and download the file.
try
if re.Exec(page) then begin
Memo1.Append('Downloading ' + baseUrl + bookname + '/' + re.Match[1]);
TFPHTTPClient.SimpleGet(baseUrl + bookname + '/' + re.Match[1], targetDirectory + re.Match[1]);
end;
finally
re.Free;
end;
It looks a bit messy on Medium but it’s just the same thing as before, except we now specify a path where the content will be saved as a file.
And there you have it, a dirt simple scraper and downloader.
In a future post we’ll take this and fancy it up, add some error checking, better regex patterns, progress bars, and make sure it works on the three big platforms: Windows, MacOS, and Linux.
If you found this useful, give it a Clap, and keep an eye out for the follow-up post!