Web Scraping in 15 Steps

--

Introduction

To fellow UCSD students, I am sure that you have tried all the UCSD dining halls. I also assume that you have an informal ranking of their food. Maybe you like the pizza from Oceanview Terrace, or maybe you don’t like the burgers at 64, etc. Personally, I have my ranking for all six dining halls, and I always want to know what others think about them as well. Filled with curiosity, I did a web scrape to acquire the reviews of each dining hall on campus. I extracted every Yelp review for each dining hall. After scraping the information, I will conduct a model that helps visualize the data. But first things first, let’s begin with this web scraping tutorial.

To start off, what is web scraping?

Web scraping is a method to extract data from the Internet. In our case, we are trying to grab all user reviews of UCSD restaurants from Yelp and turn those data into a spreadsheet form. Although one way to do that is to go through every review and paste them into the spreadsheet, this method would take a long time when the dataset is very large. In this tutorial, there are around 200 reviews that we will scrape from Yelp, so copying and pasting every review would take a huge amount of time.

What is required for web scraping?

● Basic knowledge of Python

You’ll only need FOR LOOPS, I promise.

● Code editor such as Sublime Text, Atom

● Little understanding of HTML

Although this project is mainly in Python, the web pages that we grab information from have their code in HTML. Don’t worry, we don’t need to know anything about HTML, but we just need to locate the right information on the Internet to scrape from.

● Before we begin, we have to import a package into our system. A package is an “add-on” in programming. Packages help our code perform specific tasks. Our package is called BeautifulSoup. It specifically parses HTML documents. If you want to know how BeautifulSoup works, click here. So, our package understands the commands to grab information from the Internet. To install BeautifulSoup, here is the code you type in your command line:

pip install beautifulsoup4

After the installation is complete, you may now open your text editor.

The Web

1. Importing packages

With BeautifulSoup installed, we are going to import it to our project, here is the line of code:

Here, we are importing BeautifulSoup from bs4 (BeautifulSoup 4, its official name), and named it as bs for simplicity. Notice here, we also import something called ’urllib.request’ on the second line. This is a module we need to open the URL from the editor.

2. Make a Spreadsheet

Before we start writing the actual code for scraping, we need to set up the spreadsheet file so that we have a place to store our data. Besides storing user reviews on Yelp, we are also storing the reviewer, the date of the review, and the dining hall that the user had eaten at.

3. Make a list

In this tutorial, we are scraping all six of the dining halls in UCSD: Cafe Ventanas, Oceanview Terrace, Pines, 64 Degrees, Canyon Vista, and Foodworx. We are going to make a large list that contains all the cafeteria names, like this:

From this screenshot above, we can see there is a list that contains six smaller lists that have 3 items in each list. This arrangement is meant to make our task easier when we iterate through list all_url to obtain the item. The first element contains the cafeteria name with “-” in between the words, the second contains the cafeteria name with “+” in between, the last element has the official name for each cafeteria. We are going to need this list in the next step.

4. Hypertext Markup Language (HTML)

Before scraping the web, we are going to conduct a method that can extract all the reviews from Yelp without missing or excessive information. Before I start writing my code, I look at every cafeteria’s information on Yelp. Now, when I start looking through the review page for Oceanview Terrace, there are 54 total reviews. Yelp stores 20 reviews on a single page, then they allow users to click the next page to see more reviews. So 54 reviews about Oceanview Terrace will be divided into 3 pages. Take a closer look at each of the URL for them:

Notice the format here, do you see a pattern? This is how they structure their webpages:

an integer (X) that starts from 0 and increments itself to 20 every page (although the first screenshot does not start from 0, which is removed by Yelp for simplicity. You could still use the link below to get to the page correctly)

https://www.yelp.com/biz/oceanview-terrace-la-jolla?osq=oceanview%20terrace&start=0

With this pattern developed, let’s check the link of other restaurants.

Pines URL with 32 reviews:

https://www.yelp.com/biz/pines-la-jolla?osq=pines&start=0

https://www.yelp.com/biz/pines-la-jolla?osq=pines&start=20

64 URL with 24 reviews:

https://www.yelp.com/biz/64-degrees-san-diego?osq=64%20degrees

https://www.yelp.com/biz/64-degrees-san-diego?osq=64%20degrees&start=20

And etc…

Our algorithm matches with other restaurant reviews as well! But wait, there is also another pattern. Notice how Yelp structure its URL again. They do it by

https://www.yelp.com/biz/ + restaurant name-location + “?osq=” + restaurant_name + “&start=” + an integer

5. Do you know For loops?

With this pattern formed, we can develop a nested for loop that iterates every restaurant with all their reviews:

Here, our all_url list is used because

The item in the first element (url[0]) can replace the “restaurant name-location” in our formula.

The item in the second element (url[1]) can replace the “the restaurant_name” in our formula.

To summarize, these 2 for loops allow us to open every cafeteria’s review page.

After we open every page, we will parse all the information with BeautifulSoup.

The Scrape

6. What should we parse?

Now, we know how to parse the websites, but we don’t want all of the information. In this tutorial, we only want to parse the name of the reviewer, the date of the review, the rating, the review content, and the restaurant name of the review.

For the restaurant name of the review, we can obtain it from the all_url list element 3 and assign it to a variable.

7. Inspection

For the next step, find the code that stores all the information from the browser. The information here includes every bit of words in a single web page. Let’s start by finding the name of the reviewer. To do that, we move our mouse pointer directly to the name, right click it, and select “inspect”.

8. What information can we extract from Inspecting?

After “Inspect” is clicked, the code will pop up on the right region of the page, and the blue shaded area is referring to the name of the reviewer. In order to get the name, date and the content, we need to back it up a bit. As you move up your pointer in the Elements, you will see the code that can direct you to the name, date and content of a user.

The code for the whole layout

9. How to FindAll()

You may be asking the purpose of extracting the code of the layout. It turns out, that every user review shares the same commonality: the code for the layout is the same! Since it is the same for every review, we’ll write the code under the 2 for loops that we did above using soup.findAll(). FindAll() has many parameters, but we just need to define the first two: the name and the attribute. In our case of the layout, the name is “li” (in HTML, <li> is used to represent an item in a list), and the attribute is whatever after it (lemon…2oFDT).

The code will look like this:

Now we know that the variable “items” contains all the information we want. However, this variable is in the form of a list. To get the information about specific item, we use another for loop inside.

10. Scrape it off!

So now, we need to find a path that can get us to extract the name, date, rating, and content. If we inspect all the code for them, we’ll see something like this:

The code from left to right referred to: name, date, content

11. FindAll() again

Take a look at the code, the name, date, rating, and content are at the end of each attribute. They are referred to as “text” in Html. How do we extract this? We’ll use the similar code as above, the findAll() method. Only this time, you’ll be adding [0].text to get all the words.

Notice that I did not put up the code for extracting the ratings, that’s because there is an exception. The code for rating does not have any words after the attributes.

Since there are no words after each attribute, we will extract the line inside the “aria-label” to get the “4 star rating” into our CSV file. To do that, we’ll still use findAll(). But this time, we will use findAll() on the top of the line that has a white background. Since the blue portion of the code is part of the white one, our idea is to find all items from the span, then we will locate “aria-label” inside the div from the span. The div has an ID of “aria-label” title, so we use div[“aria-label”] to locate the ratings according to BeautifulSoup structure.

12. Print print print

As you write the codes, try to use print statements every time you want to test it. For my case, I write out all the print statements at the end so I can get a glimpse of the whole output.

13. Export

Now we have everything extracted from Yelp, we are going to export them into the CSV file for each iteration. So we will use file.write() for the next line to import “names”, “date”, “location”, “stars”, “reviews” to CSV, with “\n” (“\n” means to start a new line) at the end according to the format. Close the file at the end when everything is finished. Remember to close the file outside of the outermost loop.

Note: inside every review content, I replace all the comma with “|”. The reason for this is that Excel will automatically separate things into a new column when they see a comma in a content. So, replacing commas with other symbols would keep every review inside a column. Although the “|” will still appear in the Spreadsheet, it is not a major issue because we will not be considering it as an important data value.

14. Test it!

After executing the code, it will print and return a Spreadsheet containing all the data we wanted.

The Code

Here is the full version of the code:

The Conclusion

In this tutorial, we learned to construct a way to extract useful information from the Internet. Web scraping is simply, efficient, and legal. After this tutorial, you are going to know how to extract other information on the Internet. For example, we can use web scraping to collect price values of a product, news reports, personal information, and so on. We just finished the extraction of user reviews of UCSD dining halls, now what? In the next tutorial, we are going to talk about how to use all the data and make a visualization to make the data become useful and meaningful.

--

--