Recently, I had to gather some textual data for my Natural Language Processing project for rating restaurants based on their food, ambience, service and other important factors. You would think, “ooh, its like all the other NLP related projects were sentiment analysis is performed on the comments and a score is generated”, yes its the same but there is a catch, I won’t be using the comments of the common public but the reviews and ratings by some food bloggers.
Now I had to get a dataset which meets my requirements, so I started surfing the internet which has a glut of information and wasn’t able to find any dataset which could be compliant to the problem at hand. There are many data-sets available on the internet for various purposes, like:
- Reddit Dataset: which consists of every reddit comment which is publicly available.
- IMDb movie dataset: textual reviews of movies
- Film Corpus: consists of written screenplays
One can use this datasets in any which way they want: to develop a chat-bot or for sarcasm detection using NLP(for this the reddit dataset is a gem) or any other application of choice. There are many other online platforms like Kaggle, Data.world, data.gov, etc… to obtain various datasets.
After all the futile efforts of finding a relevant dataset I thought to scrap some of the food blogs as I had read about web scrapers and spiders earlier and had some theoretical knowledge about how they work. I started searching for example articles and videos to get a practical knowledge and some of the articles and videos helped a lot in understanding the convoluted concepts of these web scrapers. While there are sufficient theoretical tutorials there are only a handful of articles which apply those concepts to a practical problem. Many scrapping libraries like BeautifulSoup, Scrappy, etc.. are available to facilitate us. For me BeautifulSoup was easily understandable so I started on with beautiful soup.
There are some handful of libraries you would need to install beforehand.
to install these on Linux type the following command,
pip3 install -U BeautifulSoup pandas
I am using Jupyter Notebooks as it is more convenient to have have live output of the code and it also offers many other features. You could work with any editor or IDE of your choice but keep in mind that we are working with python3 and some of the libraries may not be the same in python2.
Disclaimer: I won’t be showing the name of the website I am scraping as there are many legal implications to this and there is a thin gray line between collecting information and stealing information.
Finally lets CODE:
Lets break this:
As mentioned earlier we would require python packages like pandas, beautiful soup and PIL if you need to download the images posted on the website that you would be scrap.
Firstly we will import all the packages, after that we will declare the
url of the website under any variable name and will use
urlopen() to open that link and then will pass it to the BeautifulSoup function which will return the HTML code of the website.
As we get the soup object there now we can get our desired data using various function calls provided by BeautifulSoup. To get the data you need from that website you must have basic understanding about HTML tags and class and how they work.
Here in my case, the website I was scraping had titles and links inside
<a> tag which were inside a
<h3> tag with class
<h3 class="entry-title"><a href= "someLink" >TITLE:</a></h3>
So I used the
find_all() method available in BeautifulSoup to get all the
<h3> tags with class equal to
entry-title and got all the tags.
Now from that tags I had to find links and title associated with that and then open those link like we did earlier using
urlopen() and then again scrap all the content(review) from that link.
I knew that all the
h3 tags with class of
entry-title had an
<a> tags with link available inside the
href, so I searched for the
href and stored its value inside a
_link list in a for loop also for the title related to every link in the
title_links list I used the
.text.strip() function available in BS4 inside that for loop. Now I had the title of the review/restaurant and the link that has the whole review document. To get the whole review document I used the same for loop and opened all the urls that I had earlier stored inside the
_link list using the
urlopen() function and then passed it to the BeautifulSoup to get the soup object. Scraping the review document was easy because the textual data was in the
<p> tags and so again I used the the find_all() function. Now here is the catch, the text from
<p> tags is stored in list and to get the whole document we need to join all the string that are available in the list. To achieve this we first create an empty string and using a for loop over the list of strings we keep on adding the the strings to the empty string we created and thus we have the whole content inside the empty string we created.
In the 2nd image you may have seen that I have declared a python dictionary
links which isn’t used until now. We will add the title, link and the whole post as a list of values inside the
links dictionary with the indices as the keys.
Once you run the script you will clearly understand what I mean.
We will store the scraped data into a
.csv file using the above code.
This is my first article so if there are any mistakes feel free to point those out by commenting down below.
If you liked the the article please give a clap or two or any amount you could afford 😁.
To know more about me please click here and if you find something interesting just shoot me a mail and if possible we could have a chat over a cup of ☕️.