Here I will explain how to use RAW Query Language (RQL) to query sitemap files and crawl them in order to fetch the URLs from the pages updated most recently on any website.
For these examples below, we will use RAW Labs platform, which can be accessed here at https://just-ask.raw-labs.com
Tutorials and documentation about RQL is available at: https://docs.raw-labs.com and the platform is described at: https://www.raw-labs.com/platform. The RQL statements in this document have been written using the RQL Jupyter notebook client.
What are sitemaps?
Sitemaps are files provided by webmasters to list which pages search engines should crawl. This is described with a simple protocol based on XML tags, basically a list of URLs with their date of last modification.
You can access a domain Sitemap by appending /sitemap.xml to any domain, like http://www.nasa.gov/sitemap.xml
Sitemaps can be grouped in index files, here is the one from the NASA website captured on the 13th of July 2021:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/sitemap.xsl"?>
You can read more about sitemaps here.
Playing with RQL and sitemaps on the nasa.gov website
First, we can learn about the sitemap.xml structure with the following RQL statement:
Which will produce the following:
The following RQL statement will extract all the sitemap records from the above XML file.
select * from read_xml(https://www.nasa.gov/sitemap.xml).sitemap
Each of these index files contain several thousands of URLs. For instance if you click on https://www.nasa.gov/sitemap-1.xml there are 12609 URLs in this sitemap at the time of writing this document (13th of July 2021).
We want to look into each of these files and find the 10 most recent pages updated from the whole website.
When RQL does something like a READ_XML() it does infer the XML file to understand its structure. This works for constant file names but not with variables. If we want to make a loop on every sitemap file we have to describe it in advance. We can use the following statement to learn about its structure.
We are only interested by the ‘loc’ and ‘lastmod’ fields in the record. This translates into the following RQL statements.
// retrieve all index files from main file
updated_urls := select * from read_xml("https://www.nasa.gov/sitemap.xml").sitemap;// create a typealias for read_xml to know about the structure
typealias loclastmod := record(
url: collection( record(
lastmod: string nullable)));// read the content
select url from updated_urls uu,
order by url.lastmod desc
Making it a generic web service
The previous statement was built around the nasa.gov website, but we can generalize it to make it work with any website respecting the same sitemap rules. To achieve this, we can make a package:
%%package sitemap// creating typealias for the index file
typealias indexfile := record(`sitemap`:
`lastmod`: string )));
// retrieving all index files
updated_urls(website: string) := select * from
read_xml[indexfile](website+"/sitemap.xml").sitemap;// creating a typealias for the sitemap so that read_xml
// knows about the structuretypealias loclastmod := record( url: collection( record(
lastmod: string nullable)));// reading the content
recentpages(website: string) := select url from
order by url.lastmod desc limit 10
We can call this package with another website compatible with the sitemap.org structure that we have coded and get the latest updates:
%%query// using the recentpages function from the
// sitemap packagefrom sitemap import recentpages
One of the great benefits of creating an RQL package is that it is automatically available as a REST endpoint.
curl -X GET https://just-ask.raw-labs.com/api/executor/1/public/query-package/sitemap/recentpages?website=https://www.virgingalactic.com --header "Authorization: Bearer ....." --header "X-RAW-output-format: json"
In only 4 lines of code, we were able to generate a crawling and filtering package able to crawl on nested XML files and callable as an API endpoint.
Quite powerful, don’t you think?