Exploring Common Crawl Database

3 min readMay 28, 2024

Common crawl is a non profit founded in 2007.

Common crawl is an open repository of web crawl data that can be accessed and analyzed by researchers, data scientists, and developers. The repository includes raw web page data, metadata, and text extracts. The primary goals of Common Crawl are to promote open access to web data, encourage innovation, and support the development of new technologies and research initiatives.

Common crawl database contains over 250 billion web pages spanning 17 years

How is the data stored?

The Data is Shared as a public dataset on Amazon S3. Common crawl provides monthly data dump in WARC format .Each crawl consists of about 3 billion web pages with a compressed size of 100 TB.

In Addition to WARC file , Common crawl provides index files as well as columnar index files so that the user can search easily , filter and download the data.

Each crawl index is spread over 300 files consisting of ~250 GB of data.

How to get data for a particular domain ?

We have a couple of options on how to access this data .

Using Simple Python code — for limited use and low data size.
Using AWS Athena — for large scale data extraction.

We will explore both briefly.

Using Python-

import requests
import pandas as pd

# Url you want to search for
url="www.bbc.com"
# Indexes you want to search
index_list = ["2020-05"]

# store the records
record_list=[]

for index in index_list:
        
 print("[*] Trying index %s" % index)
        
 cc_url  = "http://index.commoncrawl.org/CC-MAIN-%s-index?" % index
 cc_url += "url=%s&matchType=domain&output=json" % url
        
 response = requests.get(cc_url)
        
 if response.status_code == 200:
            
    records = response.content.splitlines()
    #print(records)
    for record in records:
        record_list.append(json.loads(record))
            
    print("[*] Added %d results." % len(records))
            
    
print("[*] Found a total of %d hits." % len(record_list))

df =pd.DataFrame.from_dict(record_list)
print(df)

We will get the following data .

We can use the file name , offset and length to extract the data for each url.

import io
from  bs4 import BeautifulSoup
import requests
import warcio


warc_filename = "crawl-data/CC-MAIN-2020-05/segments/1579251778272.69/warc/CC-MAIN-20200128122813-20200128152813-00252.warc.gz"
warc_record_offset = 766472050
warc_record_length = 23936

response = requests.get(f'https://data.commoncrawl.org/{warc_filename}',
                        headers={'Range': f'bytes={warc_record_offset}-{warc_record_offset + warc_record_length - 1}'})

with io.BytesIO(response.content) as stream:
    for record in warcio.ArchiveIterator(stream):
        html = record.content_stream().read()
        print(BeautifulSoup(html, 'html.parser').prettify())

Using AWS Athena-

To use athena we will need a S3 storage bucket to store the data retrieved from running the athena query .

The following documentation provides and in depth understanding of how to use Athena.

Common Crawl - Blog - Index to WARC Files and URLs in Columnar Format

We're happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we…

commoncrawl.org

Common Crawl Index Athena

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are…

skeptric.com

I hope this blog has been useful to you .Cheers!!

Exploring Common Crawl Database

Common Crawl - Blog - Index to WARC Files and URLs in Columnar Format

We're happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we…

Common Crawl Index Athena

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are…

Written by Alok G V

No responses yet