Web scraping by a tableau consultant gone rogue...

Learning Web Scraping with Python, Requests, & BeautifulSoup

Did you know learning web scraping w/ Python, Requests, and Beautiful Soup is easy...

Tyler Garrett

--

Learning Web scraping with python, requests library, and beautifulsoup is a tall glass of milk without the right blog. I got stuck on a few terrible blogs, and wrote this to help people get past the ramp up. You can jump on a free example of the exact code here!

Yeah web scraping is easy…

Unless you are me.

Because if you’re me, web scraping with python, requests, and beautifulsoup may open a massive can of worms.

If you’re me, you will find some blog written by someone who has no earthly idea how to teach or write python, and it will waste the next 5 hours of your life. If you’re me, you will find some way to overcomplicate learning web scraping!

Luckily you’re you, and you will be learning web scraping much faster than usual because you’re here.

Others think this is a good starting place for web scraping too!
Thanks Kab1983.

I built this blog for people who over complicate simple.
Questions? Msg me on Twitter.

And because of my major strikeout when learning, I’ve decided to explain everything in the code, line by line. By the end of the blog, you will learn how to do web scraping with python, requests, and utilizing beautifulsoup too.

Let’s get started, this blog will cover the basics of web scraping, what you need to get started scraping the web, and an explanation of the code used.

Not a lot of code — We will be walking through only 10 lines of code.

from bs4 import BeautifulSoup
import requests
url = 'https://tylergarrett.com'
r = requests.get(url)
s = BeautifulSoup(r.content, 'html.parser')
a = []
for text in s.find_all('p'):
b = text.get_text()
a.append(b)
print(a)

The python code above does 5 core processes. You can change the website URL to your domain. I’m going to cautiously use my personal domain because I understand the HTML code a little bit better than some random website.

Let’s discuss the code from a high level because we need to understand each step, to begin building our very own web scraping bot.

The 10 lines of code… Which is our algorithm, or you could even call these the comments in your code… It helps to write what each step will be, then you’re stuck googling, copy paste, debug, no brainer.

(You can jump on a free example of the exact code here! Not costs, I think it will email me for permission if you don’t make a copy/duplicate for yourself. Requires a gmail password to dive deeper.)

Okay here’s our web scraping algorithm.

  1. requests www.tylergarrett.com
  2. finds all the the text
  3. cleans code from the text
  4. appends text to an array
  5. prints the array
Probably seems a little crazy — but now it’s extremely easy to start scoring these words using sentiment analysis or some sort of sentiment word scoring solution.

Why Web scraping w/ Python?

Python is big on community and the community has built solutions to all of your web scraping needs.

Because web scraping offers an endless source of data.

Learning web scraping offers an endless source to learn. Mostly because people never write HTML the same, people write HTML differently, and that makes web scraping challenging.

Luckily people like Leonard have your back. Leonard developed beautiful soup, a library to help you parse HTML & XML, and Leonard will likely be a source of truth moving forward.

So to answer WHY WEB scraping with python

Well, truly there are many websites with interesting sets of data to consume, and I’d like to start writing that data to data source. I want to track the growth, change, sentiment, and I don’t want to manually do the task.

The basics of web scraping.

To begin your web scraping journey, it’s important to know where you’re coming from and where you’re going.

Now, reading these words. You’re consuming rendered HTML.

In the future, you will want to convert HTML into meaningful insights.

HTML is being rendered on your browser, that’s why you’re looking at beautiful text and not code.

Lastly, your browser renders HTML. Right click your browser on some text or data, inspect the element. Look at the underlying html!

What do I need to start scraping the web?

You will need to complete 3 of the 4 steps to begin scraping the web.

  1. Download python — we are using python 3.7 (or skip to step 2)
  2. Download an IDE like pycharm (comes with python)
  3. pip install requests (install guide/website/docs)
  4. pip install beautifulsoup4 (install guide/website/docs)

Once you have your environment setup, let’s discuss the code!

Let’s begin looking at the code!

Call your libraries.

Calling your libraries open access to using the tools.

from bs4 import BeautifulSoup
import requests

Line 1: from bs4 import BeautifulSoup

Your pip install will setup a library of code developed by Leonard called BeautifulSoup.

Line 1 needs to be at the top of your code, it gives you access to the beautiful soup parsing capability & tons of other features.

Line 2: import requests

Your pip install will setup a library of code developed by Kenneth called Requests.

Line 2 needs to be at the top of your code, it gives you access to requests library - “Requests’ simple API means that all forms of HTTP request are as obvious.”

Moving on, line 3, line 4, line 5…

Next, we need to request our URL & build our soup object. Requests makes this process easy & beautiful soup plugs in perfectly behind.

url = 'https://tylergarrett.com'
r = requests.get(url)
s = BeautifulSoup(r.content, 'html.parser')

Line 3: url = ‘https://tylergarrett.com'

Similar to math class, X and Y stands for another value. We are using url, to stand for our link to https://tylergarrett.com. We can now say url, instead of saying ‘https://tylergarrett.com’ which can be a hassle to repeat multiple times. Also, repeating url multiple times will save us time, imagine having to go fix https://tylergarrett.com 100 times in your code, generating this line of code offers a lot of flexibility as we progress in python.

Line 4: r = requests.get(url)

Line 4 works great with our ‘url’ variable but you can also write the line without the need for the ‘url’ variable. Line 3 and Line 4 could be re-written like the following line of code.

r = requests.get('https://tylergarrett.com')

Line 5: s = BeautifulSoup(r.content, ‘html.parser’)

Here we are constructing the soup object. What that means for you is we are generating a line of code that generates a re-usable object. The soup object is essentially the HTML document.

The object is usually called soup, but I prefer using the letter s.

s = BeautifulSoup(r.content, 'html.parser')

We will use the letter s, later down the script. Also, in Line 5, you will notice r.content. Notice the “r”… The r is loaded with our Line 4 request! We bring the r.content into the BeautifulSoup code, we call the ‘html.parser’ to parse our html too.

Basic Easy: Don’t worry about this part until you need extra help. Utilize this line of code and keep up with the ‘s.’ You only need to worry about the s value passing through the script.

Advanced Hard: BeautifulSoup is beautiful in that it will try to handle picking the best parser for you. This is helpful because if you’re new, we already have a lot to learn. Letting this tool do the work for you is likely a best practice, but for all other things, you may want to lean on his parser differences. You can quickly specify the parser you need to use in your project.

Building an array with our web scraped text data.

Generating an array is rather straight forward with Python but if you’re not familiar with list comprehensions, you may need a light ramp up.

Easy method to make a simple list in python.

x = [i for i in range(10)]
print x

#Output
## >>>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Complex method to make a simple list in python.

Used in this blog!

x2 = []
for x3 in range(10):
x2.append(x3)
print x2

#Output
## >>>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In today’s blog, I’m going to use the previous method. If you follow along to my list comprehensions blog, you will learn another method.

Let’s begin building our array with beautiful soups “find_all()”

a = []
for text in s.find_all('p'):
b = text.get_text()
a.append(b)
print(a)

Before we dive into line 6 through line 10, let’s peak at our entire code again!

Not bad!

What is this saying in english?

For each segment of HTML code containing <p class… get ONLY text from these segments, and append the text to our a=[] array.

Line 6: a=[]

This little bit of code lets you generate the beginning of your array. I like to consider this our data holder. Data will go into our array.

EXAMPLE = [‘data’, ‘is awesome’, ‘right’]

Line 7: for text in s.find_all(‘p’):

Line 7 kick starts your array, you could start learn how to use “list comprehensions” and although this may seem complex at first glance, the more you play with list comprehensions, the quicker it will become a powerful tool!

Line 8: b = text.get_text()

By now, you’re familiar with passing variables and I don’t need to explain a lot here. Rather, I’d like to discuss the get_text() method, it’s relevant to BeautifulSoup. Text is the variable used in our array example & passes the “<p class… ” related containers of data to our line 8. Looks like…

Grabbing data within these <p class… </p> offers “body text”

Line 9: a.append(b)

Appending the data to your array, be clear on the usage of the letters here too. The letter a, is the letter of our array from line 6. The letter b, is all of the get_text() from the previous step. We are looping through the HTML, iterating through the code, and appending the text to our array!

Line 10: print(a)

Last but not least, let’s print our array “a.”

There is a lot of opinions on web scrapping, and on how to get started, I like this direction because…

  1. web scraping with python, requests, and beautifulsoup is free.
  2. web scraping with python, requests, and beautifulsoup is open source.
  3. web scraping with python, requests, and beautifulsoup is automated.

By now, you should have all the necessary steps to generate a simple array of text coming from a website URL; utilizing python, requests, and beautiful soup.

Here’s my future logo for these bots. lol. He is wearing a red hat…

Anyways, happy holidays.

Thanks for your time.
Typos by Tyler Garrett

Maybe this was over your head?

Learn how to become an analytics professional in my free analytics automation training series @ https://knime.dev

--

--