Common crawl is a non profit founded in 2007.
Common crawl is an open repository of web crawl data that can be accessed and analyzed by researchers, data scientists, and developers. The repository includes raw web page data, metadata, and text extracts. The primary goals of Common Crawl are to promote open access to web data, encourage innovation, and support the development of new technologies and research initiatives.
Common crawl database contains over 250 billion web pages spanning 17 years
How is the data stored?
The Data is Shared as a public dataset on Amazon S3. Common crawl provides monthly data dump in WARC format .Each crawl consists of about 3 billion web pages with a compressed size of 100 TB.
In Addition to WARC file , Common crawl provides index files as well as columnar index files so that the user can search easily , filter and download the data.
Each crawl index is spread over 300 files consisting of ~250 GB of data.
How to get data for a particular domain ?
We have a couple of options on how to access this data .
- Using Simple Python code — for limited use and low data size.
- Using AWS Athena — for large scale data extraction.
We will explore both briefly.
- Using Python-
import requests
import pandas as pd
# Url you want to search for
url="www.bbc.com"
# Indexes you want to search
index_list = ["2020-05"]
# store the records
record_list=[]
for index in index_list:
print("[*] Trying index %s" % index)
cc_url = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index
cc_url += "url=%s&matchType=domain&output=json" % url
response = requests.get(cc_url)
if response.status_code == 200:
records = response.content.splitlines()
#print(records)
for record in records:
record_list.append(json.loads(record))
print("[*] Added %d results." % len(records))
print("[*] Found a total of %d hits." % len(record_list))
df =pd.DataFrame.from_dict(record_list)
print(df)
We will get the following data .
We can use the file name , offset and length to extract the data for each url.
import io
from bs4 import BeautifulSoup
import requests
import warcio
warc_filename = "crawl-data/CC-MAIN-2020-05/segments/1579251778272.69/warc/CC-MAIN-20200128122813-20200128152813-00252.warc.gz"
warc_record_offset = 766472050
warc_record_length = 23936
response = requests.get(f'https://data.commoncrawl.org/{warc_filename}',
headers={'Range': f'bytes={warc_record_offset}-{warc_record_offset + warc_record_length - 1}'})
with io.BytesIO(response.content) as stream:
for record in warcio.ArchiveIterator(stream):
html = record.content_stream().read()
print(BeautifulSoup(html, 'html.parser').prettify())
- Using AWS Athena-
To use athena we will need a S3 storage bucket to store the data retrieved from running the athena query .
The following documentation provides and in depth understanding of how to use Athena.
I hope this blog has been useful to you .Cheers!!