Deeper Swift Web Crawler Script Explanation
The script can be separated in fours steps:
- Visit a web page
- Search for the Word we’re interested in
- Collect all the links
1. Visiting a Web Page
In this case I’m using Foundation’s URLSession: what we’re doing here is defining an URLSession task where we indicate our webpage url.
After the task definition, we start the task (a.k.a we request the download the webpage) by calling .resume().
Once downloaded, the task invokes its completionHandler where we verify that no errors have occurred and where we finally start parsing the page.
2. Searching for the Word in the Document
This is a small trick: since we can treat the whole webpage as a String, we’re using Foundation’s contains(_:) to check whether the word is present or not.
3. Collecting URLS
One more trick. There are better ways to analyze a webpage but I wanted to keep the script simple and 100% independent from any third-party libraries.
What we’re doing here is using NSRegularExpression to find all the document urls. Please note how my regular expression fails to detect any relative url path, and urls that don’t start with http: feel free to submit a PR!
Once we have all the urls, we return the whole collection (that later is added to the web pages to visit).
All of the steps above are repeated everytime we call crawl().
This function first checks whether we have visited enough pages (visitedPages.count <= maximumPagesToVisit) and if we have any other web pages to visit (guard let pageToVisit = pagesToVisit.popFirst()).
In case we pass both controls, then the function checks if we have visited this new pageToVisit already: if we haven’t, we jump to step 1, otherwise we call crawl() again.
That’s all! Happy scripting! 😊