Scrape an ecommerce dataset with Scrapy, step-by-step

A functional, real-world scraper, with various learnings along the way.

Britton Upchurch
16 min readFeb 7, 2019

DIY datasets? You’ll probably need to scrape.

Scraping is something a lot of data scientists seem to need at some point, but it isn’t taught in machine learning courses. Like many programming skills, scraping is just one of those things you’re expected to pick up on your own.

So I hope this post can help those new to scraping and offer encouragement to anyone feeling intimidated. As a beginning coder, scraping seemed overwhelming and messy; I also felt that most tutorials only built little toy scrapers that weren’t really helpful.

So in the spirit of writing the blog post I wish I found when I started, I’m going to lay out step-by-step (in great detail) how I scrape ecommerce datasets in csv format with Scrapy, from the very beginning to the final .csv output.

If you’re new to coding (like me), you may have a few more hurtles (for example, what in the world is XPath?). The only real background you need is Python, since Scrapy is just a few python files that you customize for the site you’re scraping.

Disclaimer: I am on a mac with Anaconda, and some of these steps won’t translate exactly to different setups. I assume Scrapy is fairly universal, but something to keep in mind if you hit problems. For system-specific instructions, see Scrapy’s official documentation.

If it’s online, it’s scrape-able.

Scraping is simply pulling out, and searching the code of a website.

It’s easy to forget that every website is made of raw HTML, some css, maybe some json or other stuff. And not only that– the code is right there for you to dig through!

The code is right there for the searching.

If you’re in Chrome, right click any page you like and hit ‘Inspect’: Chrome will show you the code that it’s executing to build that page. So if you want to, say, gather every price for every product on Walmart.com, the info is just sitting there buried in the source code.

This was a weird realization for me– I felt that there must be some data protections that I had to get around, or that corporate websites wouldn’t let me just…take things. But there is it: most everything that you find online is, by virtue of how the internet works, scrape-able.

So how does scraping work?

Just like your browser gets site code from a server, your scraper sends a request to a server and receives a response made up of the HTML, CSS, etc. that make up that page. Then it digs through the response code to find your info.

Simple, right?

The tricky part is showing your scraper exactly where to find your information in that huge pile of code it receives. That’s XPath, and I’ll talk about that in a bit.

What I’m scraping

I want all the text and numerical data from every product page on an ecommerce site (in this case, I’m scraping Harvey Norman, an Australian retailer).

That means title, category, price, description, features, dimensions– basically everything except the picture of the product and reviews. I want this all in a .csv.

Install Scrapy in a virtual environment: what’s that?

Scrapy strongly recommend installing into a dedicated virtual environment.

I had seen virtualenvs mentioned before, but up to this point never knew what they are or how they work, or why they’re good. If that’s you as well, here’s the basic idea:

Every software package comes with slightly different version dependencies (which version of python, pytorch, anaconda, whatever). If, like me, you simply conda or pip install globally and just pray that nothing breaks, your wires are bound to cross before long.

So how does a virtualenv keep your wires separate and organized? It creates a little container of software dependencies that only apply within that one directory! So you can create an env, download the package you want with the configurations for that package, and nothing external will change (nor will any external versions clash with your env’s package).

Great, right?

For me, dependency problems would always pop up at the worst time, right when I’m excited to use some new package, and virtualenvs totally solve that. If you want to learn more, go here.

How do I create a virtualenv for Scrapy?

If you use Anaconda, you’ve already got it installed. Otherwise check out the virtualenv installation docs.

To create your new virtualenv (this will create a folder in Anaconda’s ‘env’ directory):

conda create --name <WhateverName>

To navigate into your new env directory and activate the environment:

cd anaconda3/env/<WhateverName>source activate <WhateverName>

To see all of your existing envs (and which one is currently active):

conda env list
My env list, before activating the Scrapy environment.

Finally you can install Scrapy within your new, activated environment:

conda install -c conda-forge scrapy

We’re ready to start a Scrapy project

Make sure your env is activated, and that you’re in your ‘scrapy’ working directory, then type in your terminal:

scrapy startproject HarveyNorman

This will configure a scraping project template, and create a directory within your scrapy folder called ‘HarveyNorman’ (or whatever you name yours). Inside the ‘HarveyNorman’ directory is:

  • an items.py file
  • some other files you don’t need to worry about
  • a directory called ‘spiders’ where your spider files will live

What is a spider? It’s the little program that actually crawls around the site code and pulls out the Your spider that will actually crawl your webpage is just a .py file that defines your own spider subclass. You simply fill in that .py file and save it in the ‘spiders’ directory.

Items.py (this is the easy part)

Items.py has to do with how elements are processed and downloaded. All you need to do in this file, is open it up and create a ‘field’ for each element you want: <element> = scrapy.Field()

Here’s mine:

Next, I’ll show you step-by-step how to make your spider.

Starter spider.py file

Scrapy’s startproject creates a tiny template, but I found it not terribly helpful. So here’s an ecomm spider.py template that I will start with, based on some previous ecomm scraping projects.

Don’t worry if this doesn’t make a lot of sense yet, but do note these parts:

  • declare_xpath, which describes the XPath for each item I want to scrape (these are all empty strings currently, and I’ll describe XPath later on)
  • below that, some parse functions. These will actually show the spider how to get into each product category, then each product page.
  • finally, a parse_main_item function that crawls the product page itself, collecting each field based on the xpaths I will declare.

Before we make our spider, a brief detour into XPath

Clearly, XPath plays a role in how my spider is going to work. But what is XPath? I didn’t know before this project, so here’s a quick overview:

XPath is a syntax for defining paths into a document, and it’s used across lots of languages like XML, PHP, C and others. In our case, we use it to direct our scraper to a specific part of the HTML. Remember how I said the tricky part of scraping is showing the scraper how to find your data in all that code? XPath is the trick: it’s the roadmap that shows the scraper how to find the one little part you want in the forest of an HTML document.

Note: you can use CSS instead of XPath to extract data, but small site changes can break CSS easily (scrapers might call it ‘brittle’) and XPath is easy once you sit with it for a half hour.

A short, incomplete explanation of XPath syntax

There are lots of great overviews on XPath, but the basic idea is that your XPath begins at the top level of the HTML (think <head>, <body>, etc) and with each backslash points one addition level down into the HTML structure.

Let’s say this is a site’s HTML:

<div>
<book>
<title lang=”en”>Game of Thrones VII:I finally wrote it</title>
<price>49.99</price>
</book>
</div>

The XPath for the price would be //div/book/price.

In actual scraping the XPaths can get terribly long and complicated, but you get the idea.

How can I get my own XPaths?

Remember how I used Chrome’s inspector to see a site’s HTML? The inspector will also give you XPaths for specific elements!

Note: I’m sure all browsers have code inspectors, but I happen to use Chrome.

Word. Thanks Chrome Inspector.

Go to the page you’re interested in scraping in Chrome, right click on an item you want to extract and click ‘inspect’.

See the highlighted line? That’s your item’s HTML element. Right click that line, and you can copy the XPath for that element directly to your clipboard.

In the spirit of honesty: this copied XPath often doesn’t quite work and I’ll have to fiddle with it before Scrapy pulls the info correctly. That’s why learning a bit about XPath really helps. In any case, if it doesn’t work right off the bat you usually have a decent starting point.

How do I know if my XPath works? The Scrapy Shell!

When I first started scraping, I would test my XPaths by literally running the spider to see if it extracted the right thing(or anything, really).

There is a much, much, muchmuchmuch easier way to test XPaths: the Scrapy Shell.

The shell is a command line debugger that lets you put in any XPath, and instantly see what is returned. So when you pull an XPath off of Chrome’s Inspector, you can just pop it in the Scrapy Shell to check that it does what you want. If it doesn’t? The Scrapy Shell is the easiest place to fiddle around and fix it.

I’ll do one to show you what I mean.

In my terminal (within my activated Scrapy env), I open it like so:

scrapy shell 'https://www.harveynorman.com.au/' --nolog

My shell is now pointed at Harvey Norman’s main landing page.

A shell session only contains the one page that you used to activate it. That is, in this case the shell is examining Harvey Norman’s main page. But if I wanted to test things deeper in– say, for Harvey Norman’s product pages– this shell session wouldn’t have a response. I would need to open a new shell session pointed at one of Harvey Normans’ product pages.

Combining the Inspector and the Scrapy shell

Now that my shell is activated, I’ll grab my first XPath to test.

In the case of Harvey Norman’s main page, I want to scrape each product category (from there I can branch off into subcategories, then products).

So I’ll right click any category, hit ‘inspect’ and copy the XPath right from the inspector.

This is the XPath I copied out: ‘//*[@id=”navMdList”]/ul/li[9]’

Back in the Scrapy Shell, I’ll test it like this:

response.xpath(‘//*[@id=”navMdList”]/ul/li[9]’).extract_first()

response.xpath is the shell command to print what the provided XPath returns, andextract_first() only displays the first element that is extracted, which makes things a lot easier to read if you’re scraping 45 objects on a page.

What am I hoping to see here? Eventually, a list of the category href links that I want my scraper to follow. For now, I’m hoping to just get any HTML elements, because that would show that my XPath does, in fact, work.

So what did that XPath return?

Nothing. Hmmm.

A few hours of XPath fiddling later…

This XPath turned out to be a tough nut to crack. In the end it took me a long time to nail down an XPath that leads to the different categories.This wasn’t the best example of finding XPaths with Chrome’s Inspector, but it does illustrate the sometimes difficult nature of scraping– sometimes everything goes smoothly, and sometimes each step is really difficult for some reason. And to be fair, though this one was hard most XPaths are pretty straightforward.

In my experience XPaths are the hardest part of scraping, partly because my approach is super hacky (I simply use trial and error and google until something works). If you know a better, more scientific way to do it, by all means write to me :)

Anyway– here’s the XPath I ended up with:

“//div[@id=’wrapper’]/div[1]/div[1]/div[1]/div[1]/div[@class=’col-md-3']/ul/li/a/@href”

And here’s what it returns in the shell. That @href at the end of the XPath? That singles out href links (there are other helpful XPath shortcuts as well, like text(), that extract all the text).

As you can see, If you’re going to be doing a fair bit of scraping, learning XPath would be a great time investment. I’m gonna move on, but you can find more details about the Scrapy shell here.

Back to the spider.py file

Descriptive variable names will help out later.

Now that I have my XPath, I’ll drop it in declare_xpath in the spider.py file I created earlier. Since this XPath is the first ‘move’ of my spider (from the main page, moving down into each product category), I’m putting it on top of my XPaths list.

That XPath was just one step. Now I’ll repeat this process for each ‘level’ of the site. Next is an XPath to get all subcategory pages from each category page. So I’ll navigate in my browser to the category page, copy the XPath from the inspector for one of the subcategories, and verify the XPath in a Scrapy shell (you have to start a new shell session with a category page of the website– exit() will close your old session). This XPath worked right off the bat.

Does each category page need its own XPath to get to its own subcategories? Nah. Within an ecomm site, categories are mostly built the same way, just with swapped-out content. So one XPath will work for most category pages, and if I miss one or two I’m not gonna sweat it.

I’ll skip all the details of finding the XPaths for each level and get straight to the product page scraping. So at this point, my declare_xpath looks like this:

You can see that I went three levels in to get to the product pages: First to categories, then to subcategories, and finally to product pages (or ‘items’).

Product page XPaths

Now that the scraper has XPaths to get through the site and find each product, I’m going to show it where on each product page to find the elements I’ve been after this whole time: title, category, description, all that stuff.

This works almost exactly like the XPaths above, except that you aren’t looking for an href attribute to follow, but rather some text or numbers within an HTML element.

I’m going to detail scraping the product title, but this method is the same for any text or number element.

First, I’ll right click the element on the page, ‘inspect’ it, and copy the XPath.

Then I open a Scrapy shell session and test the XPath. If you see the product name in your shell response, you’re on the right track. At this point I’m typically adding /text() to the end of my XPaths, to extract only the words or numbers of that element.

This image below is my shell– you can see how my initial XPath pulled a span element that contained the name I want, then I just tried the same XPath again, but with text() on the end. That’ll do just fine.

So I repeat that for each element on the product page I want to capture, put each one intodeclare_xpaths, and end up looking like this:

That’s it for XPaths! Well, for now. I’ll usually have to tweak some of them later on, once I’m actually scraping.

Now I’ll move on to the parse functions that actually walk the spider through the site, using the XPaths as its guide.

Parsing methods

What are these parse methods? They the spider how to follow the various XPaths I defined in order to crawl through the site and retrieve the target elements.

That is to say: should the spider use this XPath to open another page? Is this XPath an element it wants to extract and save? What does it do about products hidden behind ‘next page’ buttons? I’ll walk through it all step-by-step.

Getting from the main page to product pages

At the heart of it, navigating your spider to the product pages is just connecting the dots between the XPaths. When your spider is called, it executes the parse method and assumes all other parsing will be called by that method (so don’t rename parse, or your spider won’t work).

I’ll start by showing the spider how to get from the main page to each category page:

This tells the spider, ‘take the links that are returned from getAllCategoriesXpath, and call parse_category on each link’ .

(I’ll write parse_category next)

What is thaturljoin line doing? response.xpath returns a selector element that you can’t pass directly on as a url, and urljoin(href.extract()) takes out and formats the link so it’s actually followable.

You’ll also notice dont_filter=True as well: this is usually unnecessary, but it has something to do with Scrapy’s default limits on requests to a duplicate domain. Later on when I actually tried to scrape, the spider refused to follow links because they registered as duplicates, and once I set dont_filer to True everything worked fine. So I put this in every step. You probably don’t need it, but if you find that your spider opens and then immediately closes without following any links, try this.

Now I’ll write parse_category:

Similarly, this method says, ‘on this category page,getallSubCategoriesXpath is going to return a bunch of links– for each of them, call parse_subcategory' (which I’ll write next).

‘Next page’ buttons

Now, parse_subcategory gets interesting: these subcategory pages list products in pages, so many products are hiding behind ‘next page’ buttons. How do I tell the scraper to follow those buttons and scrape the rest of the items?

On any subcategory page, I’ll inspect the ‘next page’ button, copy its XPath, verify it in the Scrapy shell, and then put it into my parsing method like this:

So the first section will look similar to my previous steps: on each subcategory page the spider will get the product href links from getAllItemsXpath and on each link call parse_main_item (which I’ll write next).

Then the function continues by checking the next_page XPath, to see if there is a next page to follow (that is, if it’s not None). If so, it follows that link and calls parse_subcategory again!

Note: it’s not part of this site so I won’t tackle it here, but infinite scrolling is also a common hiccup at this stage. I found this blog post really helpful for that.

Before parsing the product pages: a few helper functions

Put these imports at the beginning of your spider.py:

import scrapy
from bs4 import BeautifulSoup
import re

And then put these methods on the bottom, after parse_main_item :

I’m going to use these when parsing elements to clean up and parse all the raw text that I scrape.

How do I parse the project pages?

Remember that Items.py file from when you created your spider? That item class you defined is what you’ll use to save out elements that you extract in parse_main_item . So first, define your item object:

item = HarveyNormanItem()

Then for each element you want to scrape, do this:

<element> =response.xpath(self.<CorrespondingXPath>).extract()
<element> =self.cleanText(self.parseText(self.listToStr(<element>)))

That will extract the element as well as format and clean it, making it easy to work with later. If an element has multiple items– for instance, feature lists or product specs– they’ll be easier to work with later if you separate each list item:

<element> = response.xpath(self.CorrespondingXpath).extract()
<element> = ‘,’.join(map(str, <element>))
<element> = self.cleanText(self.parseText(<element>))

Once you’ve got all the elements you want declared, attach them to the item with item[<element>] = <element>

Then return item. That’s it!

What if I want to extract the image too?

Honestly, I haven’t gotten it to work (yet) but Scrapy has a way to scrape images alongside everything else using their ImagePipeline. Learn more here and here, and if you get stuck, this thread or this blog post both deal with ImagePipeline. Once I can get it to work I’ll update this post with what I’ve learned.

Putting it all together

Here’s my spider.py file in full:

Scraping and trouble shooting

To run the spider and output into a .csv file, type in the terminal:

scrapy crawl <yourSpiderName> -t csv -o <outputName>.csv

In my experience it almost never works without a bit of debugging. The progress log and error tracebacks from the terminal are very helpful, and it will help that you know how the different parts of the spider fit together.

My approach to debugging is to simply to run the spider and watch the log to see what isn’t working properly, then ‘Ctrl+C’ twice to stop the spider. Then I’ll debug and try it again.

A problem I often have is XPaths that either aren’t extracting properly, or are not linking the parsing methods correctly. These can be tricky to spot because sometimes your spider will work, but when you look closely there may be one or two items pulling empty data. Here’s an example:

See those empty fields? That’s a sign that my XPaths for descriptions and features have some issues. In this case I’ll grab this particular page (it’s directly above in the spider log), open it in the Scrapy shell as well as in my browser, and test the XPaths to see what’s wrong.

Once everything seems to be working, I’ll keep an eye on the scraping log for a few minutes just to make sure everything is ship shape before I walk away and let it run.

Final Output

After running the spider for about 20 minutes, it finishes up and what have I dot? A nice .csv file with all my fields nicely populated! I always have a blank field here and there, but in my experience that’s just scraping for you.

A bit of my final .csv

That’s that!

Creating a dataset can be a lot of work, but I have found it exciting to break away from the same old datasets that everyone else has already squeezed dry.

I hope this walkthrough helps another data science student on their learning journey. There is still a lot between this scraped data and a successful model–cleaning, formatting and combining scraped data, for example, is a huge task. But this is the first step.

If any of you have questions, thoughts, or scraping advice for me, please comment here or find me on Twitter. And if you got this far, thanks very much for reading.

--

--