Discover last updated pages on a website

Introduction

Here I will explain how to use RAW Query Language (RQL) to query sitemap files and crawl them in order to fetch the URLs from the pages updated most recently on any website.

What are sitemaps?

Sitemaps are files provided by webmasters to list which pages search engines should crawl. This is described with a simple protocol based on XML tags, basically a list of URLs with their date of last modification.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/sitemap.xsl"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.nasa.gov/sitemap-1.xml</loc>
<lastmod>2021-07-13T11:30Z</lastmod>
</sitemap>
<sitemap>
<loc>http://www.nasa.gov/sitemap-2.xml</loc>
<lastmod>2021-07-13T11:30Z</lastmod>
</sitemap>
<sitemap>
<loc>http://www.nasa.gov/sitemap-3.xml</loc>
<lastmod>2021-07-13T11:30Z</lastmod>
</sitemap>
<sitemap>
<loc>http://www.nasa.gov/sitemap-4.xml</loc>
<lastmod>2021-07-13T11:30Z</lastmod>
</sitemap>
<sitemap>
<loc>http://www.nasa.gov/sitemap-5.xml</loc>
<lastmod>2021-07-13T11:30Z</lastmod>
</sitemap>
</sitemapindex>

Playing with RQL and sitemaps on the nasa.gov website

First, we can learn about the sitemap.xml structure with the following RQL statement:

%%query
describe(https://www.nasa.gov/sitemap.xml)
%%query
select * from read_xml(https://www.nasa.gov/sitemap.xml).sitemap
%%query
describe("https://www.nasa.gov/sitemap-1.xml")
%%query
// retrieve all index files from main file
updated_urls := select * from read_xml("https://www.nasa.gov/sitemap.xml").sitemap;
// create a typealias for read_xml to know about the structure
typealias loclastmod := record(
url: collection( record(
loc: string,
lastmod: string nullable)));
// read the content
select url from updated_urls uu,
read_xml[loclastmod](uu.loc).url url
order by url.lastmod desc
limit 10

Making it a generic web service

The previous statement was built around the nasa.gov website, but we can generalize it to make it work with any website respecting the same sitemap rules. To achieve this, we can make a package:

%%package sitemap// creating typealias for the index file
typealias indexfile := record(`sitemap`:
collection( record(
`loc`: string,
`lastmod`: string )));
// retrieving all index files
updated_urls(website: string) := select * from
read_xml[indexfile](website+"/sitemap.xml").sitemap;
// creating a typealias for the sitemap so that read_xml
// knows about the structure
typealias loclastmod := record( url: collection( record(
loc: string,
lastmod: string nullable)));
// reading the content
recentpages(website: string) := select url from
updated_urls(website) uu,
read_xml[loclastmod](uu.loc).url url
order by url.lastmod desc limit 10
%%query// using the recentpages function from the 
// sitemap package
from sitemap import recentpages
recentpages("https://www.virgingalactic.com/")
curl -X GET https://just-ask.raw-labs.com/api/executor/1/public/query-package/sitemap/recentpages?website=https://www.virgingalactic.com --header "Authorization: Bearer ....." --header "X-RAW-output-format: json"

Conclusion

In only 4 lines of code, we were able to generate a crawling and filtering package able to crawl on nested XML files and callable as an API endpoint.