CodeX
Published in

CodeX

CODEX

Building a Newegg Web Scraper (Part 1)

Downloading the right tools for the job

Photo by Andrey Tikhonovskiy on Unsplash

What’s a web scraper?

We can write programs that allow us to read certain kinds of files. These can include text (.txt), comma-separated values (.csv), and image files (.jpg, .png, .bmp) (just to name a few). However, a web scraper is an application that reads the HTML code from a website. HTML parsing libraries can be used to interact with the website source code in an elegant way.

Description of the scraper I built

The web scraper I built is based off a tutorial video from YouTube (the video will be referenced at the bottom of the article). The video focuses on building a web scraper to obtain information from the Newegg website. Information like brand name, product name, and shipping are retrieved from graphics cards. I wanted to add the ability to scrape the pricing information from each graphics card as well. This small change led to substantial modifications in the original code. Obtaining pricing information forced me to have to account for items that are out of stock. I also had to add code to account for advertisements. If I did not check for advertisements, then my scraper would be diminished in its ability to obtain information. I will provide more details about these installments in the upcoming part of this series.

Photo by NeONBRAND on Unsplash

Tools needed for this build

  • Python
  • Pip
  • Beautiful Soup

Python is a high-level programming language (this just means that it has English-like syntax). The thing that you want to build is dependent upon the language that you use. There are languages that are used to build specific things and others that can be used to build a wide variety of things. It seems that the number of Python-based web scraping libraries to choose from are never-ending. Therefore, Python is the language of choice when it comes to building a web scraper.

Pip is a Python package manager. Programming languages have many standard libraries that constitute the features associated with the language as a whole. Standard libraries are the ones that you can access directly (by using some kind of keyword to include the library in your project) and indirectly (by the library being included in your project by default). Libraries that are not a part of the core language need to be downloaded separately so that you can use them in your project. Pip allows us to download external libraries that are not local to the Python programming language.

Beautiful Soup is a Python-based library used for web scraping (it is more formally called an HTML parser). There are many libraries out there that can be used to web scrape, however, there seems to be a variety of sources that make use of Beautiful Soup. If you want to use another library like Selenium or Scrapy, then you are more than welcome to.

Photo by James Harrison on Unsplash

Downloading Python

Get the latest version of Python. If you are on a Windows machine, then you will have to add the location of the core Python scripts to your path environment variable. Commands that need to be executed from the command prompt will not work if this is not done.

Downloading Pip

Save get-pip.py to your computer. Use the terminal to set your directory to the one that contains the file get-pip.py. Once this is done, then enter this command into your terminal:

py get-pip.py

The line above will execute the code from the file and install pip to your computer. You can also write “python” instead of “py” if the line above does not work for you. My system allows me to execute Python code by writing “py” before the intended file. Whatever works with your system, then go with that.

Downloading Beautiful Soup

Write this line of code into your terminal to install Beautiful Soup:

pip install bs4

You can also check out the Beautiful Soup documentation for further installation information. When you are on the site, scroll down to “Installing Beautiful Soup.”

References

Data Science Dojo. (2017, January 6). Intro to Web Scraping with Python and Beautiful Soup. [YouTube video]. Data Science Dojo. Retrieved from https://www.youtube.com/watch?v=XQgXKtPSzUI&t=205s

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store