Data scraping with Scrapy and Excel

Tommy Zhang
4 min readJan 28, 2022

--

Web scraping has become a really common way to extract information from websites and turn them into structured data. Today, I am going to show you how to use Scrapy, a fast high-level web crawling and web scraping framework to scrape the Best Buy GPU product page.

The website we are going to scrape from

Finished scraped Data into a CSV

First, we have to install the necessary packages

pip install Scrapy

Once the package is installed, we can start building our first Spider.

According to Scrapy official documentation, Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

In short words, we use mainly use spiders to extract the HTML we desire from the website

I provide my Spider code to scrape below.

Now, Let’s breakdown the code:

First, we created a python class called GPUSpider. It is inherited from the parent class scrapy.Spider.

class GPUSpider(scrapy.Spider):

Inside of class GPUSpider, there is a method called start_requests. This method specifies how we going to send HTTP requests to the best buy website. I provide the starting URL and header necessary for the GET request.

The next method is called parse. In this method, we specify which element we want to extract. In my case, I want to extract the price, item title, and availability of the item as shown below.

HTML elements we want to extract

By inspecting the HTML element (You press F12 or CTRL+SHIFT+J to lanuch the chrome debugger tool), I saw that each listing is a list tag with a class called sku-item. The title of the product is div tag with class name called sku-title. The price is a div tag with a class name called priceView-customer-price. For the add to chart, it is a little tricky. I select its attribute call data-button-state. This attibutes tell us whether the item is available for purchase or sold out.

# select div tag with css name of sku-title'product_title': gpu.css('div.sku-title a::text').get(),# select div tag with css name of customer-price'product_price': gpu.css('div.priceView-customer-price span::text').get()# select data-button-state attibute of add-to-cart-button
'availability': gpu.xpath("//button[contains(@class, 'add-to-cart-button')]/@data-button-state").get()

Finally, for this project, I not only want to scrape the first page, but I also want to scrape the information from the following pages that contain GPU. In scrapy, request.follow method allows to go to the next url. In our case, it is the href of next button.

Now, we finish with building with spider. Let’s run the file.

python .\gpu_crawler.py

Within 18 seconds, we scrape 12 pages of product listing. Wow

Optional:

I also develop a Excel Macro to run the python script in excel workbook

Sub start_scraping()

'Declaring shell object
Dim oShell As Object

Dim oExec As Object
Dim sCmd As String

'script to run the crawler The workbook is in the same directory as gpu_crawler.py
sCmd = "python " & Chr(34) & ThisWorkbook.path & "\gpu_crawler.py" & Chr(34)

Set oShell = CreateObject("WScript.Shell")

'run the shell
Set oExec = oShell.Exec(sCmd)
End Sub

I use pywin32 package to open the csv file when we finish scraping the website.

if os.path.exists(wb_path):excel = win32.gencache.EnsureDispatch('Excel.Application')excel.Workbooks.Open(os.path.abspath(wb_path), ReadOnly=1)excel.Visible = 1excel.WindowState = win32.constants.xlMaximized

That’s it. Thanks for reading

--

--