Mining Common Crawl with PHP

Paulius Rimavičius
5 min readMay 26, 2016

--

A month ago I used a Common Crawl dataset to test one of my business ideas. Common Crawl is a public 40Tb dataset of raw web crawl data. To be useful for my business idea Common Crawl should cover at least 80% of the web. I searched and did not find the exact number. So I decided to do the investigation myself.

The Idea

To get the precise Common Crawl web coverage number I decided to get all the Wordpress themes found on Common Crawl and compare the number with the ThemeForest (biggest Wordpress theme marketplace) sales.

The challenge

Common Crawl is big. The data I used contains 1.73B urls and has 34900 WARC files (one WARC file usually contains ~800MB gzipped data). So the challenge was to get through all this data. I am not a data scientist myself, my background comes from web development. I’ve read that I need to use Hadoop or Spark to get the job done, but for this small experiment I decided to go with PHP. I’ve expected to find some source code on Github I can quickly adopt for my needs. I was wrong again, it seems noboby is doing this kind of work using PHP. That did not stop me.

First steps

I downloaded “All WARC files” file. Extracted it. I found a 34900 links to WARC files. Next I downloaded first WARC file. It was also gzipped, so I extracted it. The extracted file contains ~4GB crawl data with raw request and response headers and response html’s. That’s what I need. I made a small php script to extract all Wordpress themes and domains from the WARC file and write the result to the out.txt file.

$filename = ‘warc’;
$handle = fopen($filename, “r”);
if ($handle) {
while (($line = fgets($handle)) !== false) {
$line_num++;
if (substr($line,0,16)==’WARC-Target-URI:’) {
$targetUrl = trim(substr($line,17));
}

$posTheme = stripos($line, ‘wp-content/themes/’);
if ($posTheme !== false) {
$posTheme += 18;
$posThemeEnd = strpos($line, ‘/’, $posTheme);
if ($posThemeEnd !== false) {
$t = substr($line, $posTheme, $posThemeEnd — $posTheme);
$lineToWrite = $targetUrl. ‘ ‘ . $t.”\n”;
if ($lineToWrite != $lastLineWritten) {
file_put_contents(‘out.txt’, $lineToWrite, FILE_APPEND | LOCK_EX);
$lastLineWritten = $lineToWrite;
}
}
}

// progress
if ($line_num % 1000000 === 0) {
echo $line_num.”\n”;
}
}
fclose($handle);
} else {
echo “error opening the file \n”;
}

I ran the script and it took ~1 minute to complete. I was impressed: 4 GB file in just 1 minute was fast.

I had a pretty simple script that did the job for one segment. I had to scale the same script to run on all 34900 segments. The WARC file download, extraction and the script itself took ~5 minutes. If I would do that on my local machine it would take 34900x5 minutes = 174,500 minutes = 2906 hours = 121 days. Too long.

Scaling Up

I decided to use Amazon EC2 cloud to speed things up. The cheapest option is to use spot instances. The m3.medium (the one that best fit my needs) instance costs (at the time of writing this article) a little bit less then ~$0.01 per hour. So my assumption was that I will spend ~ 121x24x0.01 = $29

I changed the script a little bit to download and extract WARC files automatically. Also I added the couple of lines to store the result on S3 bucket.

I created EC2 UBUNTU instance. Installed PHP, uploaded the script, tested it.

Best way to create a EC2 spot instance is to create it from the image. In that case you can create multiple spot instances doing the same job with one click. I created an image (It takes an hour to create an image on EC2 — way too long). And launched 5 spot instances out of the image. My initial strategy was to use S3 itself to store the segments that are already finished and what are not. But all my strategies to do that failed. I had implement external semaphore service to lock the scripts for some time. With external semaphore servise I managed to launch 5 simultaneous EC2 instances that were working exactly as I expected, except that instead of 5 minutes it took up to 10 minutes to execute the script (m3.medium is really limited instance).

Scaling up Part 2

Now I wanted to start 100 EC2 spot instances. I tried but I hit 20 spot instance limit I didn’t know anything about. It appears that you need to contact Amazon AWS support to get the limit lifted. I get this limit lifted in a few days. My initial request to get this limit lifted to 1000 was denied. I requested 300 and it was OK this time. It took half a day for this limit to get activated.

Finally I launched 100 EC2 m3.medium servers! And it was doing great until it just stopped. In around 4 hours of work the servers got slow. 20 times slower then the initial speed. I killed these instances and waited for 30 mins to start over. After 4 hours it got slow again. I don’t know if that is some kind of throttling or I was just unlucky, but it looked like a pattern.

Finally I used all 300 servers to get the job done. I did the manual restart for at least 10 times and it took couple of days more then expected.

Getting the results from S3

The results of the script I stored in S3. It was very easy. AWS PHP SDK is really easy to use. The only problem was to download the results back. I would suggest to merge the results to one file and zip it or do not download the the files from AWS at all as it took half a day to download 18Gb of result files. AWS bandwidth limit took place.

The Results

Finally I imported the result to the database. There are 4.573.387 different domains using Wordpress in Common Crawl. This is a list of top used themes:

pub 651165 
h4 415293
genesis 69626
twentyeleven 67447
twentytwelve 53274
twentyten 52642
divi 45945
avada 44101
twentyfourteen 34825
twentyfifteen 28499
enfold 24288
twentythirteen 22825
responsive 19185
canvas 19157
default 18907
x 12937
atahualpa 11723
salient 11477
graphene 11445
u-design 10877
twentysixteen 10393
dt-the7 9687
avada-child-theme 9520
premium 9412
headway 9307
suffusion 9264
cherryframework 9239
customizr 9153
sahifa 8996
vantage 8842
hueman 8132
jupiter 7866
bridge 7453
thesis_18 6928
thematic 6840
prophoto5 6726
betheme 6443
divi-child 6396
pinboard 6249
chameleon 6089
weaver-ii 6074
ipinpro 5938
newspaper 5685
mantra 5555
pagelines 5422
sparkling 4954
builder 4880
thesis 4703
karma 4631
enfold-child 4559

By comparing the sales on themeforest.net and the number of instances on the Common Crawl data I came to conclusion that Common Crawl data covers approximately 27% of the domains on the web. So it was not enough for my business idea, but at least I tried :)

The final bill for analyzing all the data in Common Crawl was $136 so it was much higher than expected $29.

P.S. At the beginning I had a problem viewing the extracted data file on Windows. Vim is good for Linux, but all the editors I knew were unable to open 4GB+ text file. Eventually I was saved by Glogg. It can open large files easily.

P.S.2. Your comments are welcome. If you have any Big Data business ideas I am open to solve them together. Contact me on LinkedIn or Facebook.

--

--