Scraping files & images using scrapy, scrapinghub and Google Cloud Storage

Aaron Cowper
2 min readMay 18, 2018

--

Recently I was looking for a simple solution for processing files and images captured during our web scrapes - primarily PDFs and product image files.

We use scrapy cloud for all of our automated web scrapes (highly recommend), and they recently added support for Amazon S3 and Google Cloud Storage.

We went with Google Cloud because a lot of the rest of our stack is with Google, and they have a 12/mth $300 free trial.

Scrapinghub has an article on support for downloading and processing images, and scrapy docs also has some information, but it took me a while to figure out how to authenticate with Google Cloud from a scrape deployed to scrapinghub, so I decided to publish my solution here as I don’t feel it was adequately covered in the aforementioned docs (they just bump you to the generic google cloud authentication docs).

Google Cloud authentication relies on a JSON file containing key information. I found the easiest way to authenticate was to build a simple subclass of the generic FilesPipeline and GCSFilesStore classes defined with the scrapy library.

First, if you have not create a Google Cloud Storage bucket and service account, see the documentation here to setup and download the credentials JSON file. Once downloaded, open the JSON file and copy the contents into the CREDENTIALS variable as shown below.

Then. within pipelines.py within your main project folder add the following:

class GCSFilesStoreJSON(GCSFilesStore):
CREDENTIALS = {
"type": "service_account",
"project_id": "COPY FROM CREDENTIALS FILE",
"private_key_id": "COPY FROM CREDENTIALS FILE",
"private_key": "COPY FROM CREDENTIALS FILE",
"client_email": "COPY FROM CREDENTIALS FILE",
"client_id": "COPY FROM CREDENTIALS FILE",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url":
"https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "COPY FROM CREDENTIALS FILE"
}
def __init__(self, uri):
from google.cloud import storage
client =
storage.Client.from_service_account_info(self.CREDENTIALS)
bucket, prefix = uri[5:].split('/', 1)
self.bucket = client.bucket(bucket)
self.prefix = prefix
class GCSFilePipeline(FilesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
super(GCSFilePipeline, self).__init__(store_uri,download_func,settings)

Next, enable in your custom Item Pipeline in settings.py:

ITEM_PIPELINES = {
'myproject.pipelines.GCSFilePipeline': 1,
}
FILES_STORE = 'gs://some_bucket_name/'
IMAGES_STORE = 'gs://some_bucket_name/'
GCS_PROJECT_ID = 'some_project_id'

Then all you need to do is save urls to a file_urls field within your scrape and the contents will automatically be uploaded to the specified bucket, e.g (downloads the google logo file):

import scrapy

class GoogleSpider(scrapy.Spider):
name = 'google_logo'
allowed_domains = ['google.com.au']
start_urls = ['https://www.google.com.au']
def parse(self, response):
item = {}
item['file_urls'] =
["{}{}".format("https://www.google.com.au",
response.xpath("//img[@id='hplogo']/@src")
.extract_first())]
yield item

Note: For image (as opposed to file) processing, replace ‘File’ with ‘Image’ everywhere above and it should work perfectly. File and Image pipelines are very similar but (from scrapy docs), the Images Pipeline has a few extra functions for processing images:

  • Convert all downloaded images to a common format (JPG) and mode (RGB)
  • Thumbnail generation
  • Check images width/height to make sure they meet a minimum constraint

--

--