AI generated image with Clip Diffusion — prompt: ‘pythonic soup’

One example of scraping with Beautiful Soup

Iva @ Tesla Institute
Artificialis
Published in
5 min readMar 17, 2022

--

part two of the Sentiment Analysis with BERT experiment with scoring on scraped data from the web — Yelp reviews with Beautiful Soup

The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest the data, skill in web scraping is needed.

Web scraping a web page involves fetching it and extracting data from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.

Automated web scraping can be a solution to speed up the data collection process. The code is written once, so it will get the information many times and from many pages.

Is web scraping legal?

Unfortunately, there’s not a cut-and-dry answer to this question. Some websites explicitly allow web scraping. Others explicitly forbid it. Many websites don’t offer any clear guidance one way or the other.

Before scraping any website, it is necessary to look for a terms and conditions page to see if there are explicit rules about scraping. If there are not, then it becomes more of a judgement call.

Remember, though, that web scraping consumes server resources for the host website. If we’re just scraping one page once, that isn’t going to cause a problem. But if our code is scraping 1,000 pages once every ten minutes, that could quickly get expensive for the website owner.

Thus, in addition to following any and all explicit rules about web scraping posted on the site, it’s also a good idea to follow these best practices:

  • Consider caching the content you scrape so that it’s only downloaded once.
  • Build pauses into your code using functions like time.sleep() to keep from overwhelming servers with too many requests too quickly.

The Beautiful Soup

The Beautiful Soup is a Python library for parsing structured data. It allows us to interact with HTML in a similar way to how you interact with a web page using developer tools.

The library exposes a couple of intuitive functions you can use to explore the HTML that’s received. Firstly, Beautiful Soup has to be installed in the terminal:

$ python -m pip install beautifulsoup4

In the last blog the experiment with BERT it was presented how the transformer can be used for scoring in the realm of sentiment analysis. Sentiments of a random text is scored with 1 to 5 where 1 is the least good, up to 5 which represents the exellent mark. Now having the BERT already instantiated and working, in this part the web page comments will be scraped and the sentiment will be scored.

First of all the URL has to be inserted (just paste the URL in the script):

#request with scraping code 
r = requests.get('https://www.yelp.com/biz/the-local-american-saloon-belgrade')
soup = BeautifulSoup(r.text, 'html.parser')

Clearly, each of the comments is already stored within the class.

Easily you can inspect this by observing HTML structure on the right side of the screen

Quick check can be performed of how the actual output of request looks like:

output of the request

This output will be passed to the Beautiful Soup, and next few lines of code will actually extract the components of interest.

#extracting specific components 
regex = re.compile('.*comment.*')
#looking for paragraphs
result = soup.find_all('p', {'class':regex})
reviews = [result.text for result in result]

So the Beautiful Soup allows us to create soup and we can observe this simply by typing ‘soup’ in the colab or jupyter notebook, you will notice the format that Beautiful Soup is able to search through. All of rewievs are wrapped in the <p> tag which is a paragraph in the ‘class’ of comment.

Secondly, we want to clear the ouptut, since the result is showing some extra text that cannot be passed directly into the model. This simply can be solved:

result[0].text

Lastly, for completion, the comments can be neatly shown in DataFrame from the Pandas module before the sentiment scoring:

import pandas as pd
import numpy as np
#passing the review
df = pd.DataFrame(np.array(reviews), columns=['review'])

Checking the output of the DataFrame and first five comments (head):

With the sentiment_score function all the rewieves will be scored:

def sentiment_score(review):   
tokens = tokenizer.encode(review, return_tensors ='pt')
result = model(tokens)
return int(torch.argmax(result.logits))+1

This function is used for every comment review stored in DataFrame:

df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))

The result of this sentiment scoring with BERT has ouptupt in a form of a DataFrame where all the results are calculated. That can be simply checked by typing ‘df’

Sentiment Scoring with BERT -result in Pandas DataFrame

resources and the code:

Real Python — Beautiful Soup

Wikipedia — Web_scraping

Colab Notebook

https://colab.research.google.com/drive/1JlVk_YIhslV6RgF6zfSxTB5NC2QwPsmf#scrollTo=qSeDG58L_975

--

--

Iva @ Tesla Institute
Artificialis

hands-on hacks, theoretical dig-ins, and real-world know-how guides. sharing my notes along the way; 📝