Building a Newegg Web Scraper (Part 2)
Bringing everything together
Adding the imports
The code above imports the Beautiful Soup library, urlopen function, BeautifulSoup function, and csv module. You will see how each of these things are used as we continue with the code.
Getting the HTML and performing parsing
In Python, you can import a function through a different alias. Therefore, request(url) and soup(page_html, ‘html.parser’) are actually urlopen(url) and BeautifulSoup(page_html, ‘html.parser’) respectively.
On lines 3 through 5, we are opening the URL, grabbing the HTML, and closing out the connection. Line 7 turns the HTML code from our URL into a parsed document. Line 8 calls find_all() to retrieve the HTML code directly associated with the graphics cards that we want to scrape. The find_all() function works in this case by finding all div elements that have a class attribute with a value of: “item-cell.” The function will return a list with all the individual div elements that matched the attribute specifications.
In the GIF above, you can see that navigating to the div element with class=“item-cell” highlights the entire MSI graphics card. This is why we need to find all of these specific div elements because each one contains an individual graphics card.
Note: I performed a right click and clicked on “Inspect” when the cursor moves to the right of the “View Details” button. The video recording did not obtain that action. This is how I was able to bring up the HTML source for the website.
Setting up the CSV to write to
In the code above, we are setting up a CSV file to store our scraped data in. The writerow() function will create our first row consisting of four columns. This row will be the indicator for what information is being stored in each column. Later in the code, we will make another call to the writerow() function to continuously add our scraped data into each respective column.
Ad-detection, data retrieval, and exception handling
Note: The for loop on line 1 would be written underneath the line: file_writer.writerow([‘Brand’, ‘Product Name’, ‘Price’, ‘Shipping’]) from the “Setting up the CSV to write to” section.
Line 2 contains the check for the infamous ad-detection. When searching for graphics cards on the Newegg website, you will notice that there is an advertisement inserted amongst all the other graphics cards. The graphics cards on the Newegg website are classified with a div element that has the attribute: class=“item-cell.” However, the advertisement is also classified with this same attribute. This means that calling the find_all() function for the first time results in the advertisement being included in the returned list. Therefore, we have to have to check for a different indicator (in this case “txt-ads-link”) to ensure that our code is only being executed on a graphics card.
Lines 4 through 8 are retrieving the brand name and product name.
Note: Variables that have “tag” at the end of them (brand_tag, tittle_tag, etc.) are storing the specific HTML tag that contains the data in question (brand name, product name, etc.).
Lines 10 through 28 are retrieving pricing and shipping information.
Note: Line 10 accounts for graphics cards that are out of stock. If a graphics card is out of stock, then we have to do a different procedure when it comes to obtaining pricing and shipping information.
The exception handling is a key component of this web scraper code. I have set it up so that if an error occurs, then the name of the error, error message, and cell (graphics card) that had the error will be printed to the console/terminal. The code for a web scraper is extremely prone to having bugs and errors occur. This is because websites (especially Newegg) are constantly being updated. Code that was working perfectly fine before will hit a road block once when the slightest change is introduced (due to the change not being accounted for). Although errors will occur at some point, the exception handling allows for the program to deal with the error and continue scraping for graphics cards that have no underlying issues. The finally statement will execute regardless of whether an exception is thrown or not to ensure that data is written to the CSV file.
I will be fixing any bugs and maintaining/updating the web scraper code on my GitHub (which will be linked to in the references below). Each time you run the code, there seems to be some new nuance that Newegg has added to their site. However, I will keep trying to solve the recurrent issues that come up along the way.
Data Science Dojo. (2017, January 6). Intro to Web Scraping with Python and Beautiful Soup. [YouTube video]. Data Science Dojo. Retrieved from https://www.youtube.com/watch?v=XQgXKtPSzUI&t=205s
My GitHub repo for the web scraper code.