One-click to download all the web pages you may want
--
The Common Crawl contains petabytes and billions of web pages. AWS Athena allows us to query Common Crawl indexes to search for interesting pages for any use case you may have. Here I explain how to do the next step: download the content of those pages.
Athena queries write their results to a S3 bucket of your choice, so a logical step would be to subscribe an AWS Lambda function to changes in that bucket. When the results are ready we can start downloading the listed web pages. This is what it looks like in Terraform:
The S3 notification above will generate Lambda events of the following format:
{
"Records": [
{
"s3": {
"bucket":{"name":"bucket-name"},
"object":{"key":"object-key"}
}
}
]
}
If we were to select 2 random articles from Wikipedia with the following query
Athena will save the results in CSV format
"warc_filename","warc_record_offset","warc_record_length"
"20210301-602.warc.gz","1843978","19366"
"20210301-602.warc.gz","1376317","85678"
We could then open this generated file, and download the contents of each web page listed there. That’s thanks to the S3 API which allows us to download just a range of bytes of those gigabyte-sized files, while each of the pages above is under 100 KB. Here’s how the beginning of what a WARC segment looks like:
To make the whole process smoother we could have one of the two following flows:
I implemented them on terraform and go files and published them on GitHub. It just takes cloning the project and running the setup script to create the necessary infrastructure on your AWS account. Then you’re ready to download as many pages as you want simply by uploading new queries or new result files to Amazon Simple Storage Service (Amazon S3).