Beautiful Soup Tutorial

Liang Xiao
7 min readJul 26, 2018

--

CS373 Summer : Liang Xiao

Introduction: what is the tool for, what was it used for on your project

Beautiful Soup is a Python extension library that allows users to scrap the data from the website. It is similar to python build-in regular expression, but it is easier to use in scraping web data since it was specially designed for HTML, XML. In my project, I Use beautiful Soup to enrich my database since there aren’t many API about endangered species available online.

History: who created it, when was it created, where was it created

Leonard Richardson created Beautiful Soup in 2004 at Launchpad. The newest version of Beautiful Soup is 4.6.0, which was released in May 2017.

Installation: step by step instructions on how to install it on different environments, especially on Docker

In both docker and terminal, you can obtain bs4 by the following command:

For python 2.x user, use

 pip install beautifulsoup4

For python 3.x user, use

pip3 install beautifulsoup4

Or you can download tarball file from https://www.crummy.com/software/BeautifulSoup

Use: step by step by instructions on how to use it

1.you need to import bs4 and requests, requests can be obtained by pip.

from bs4 import BeautifulSoupimport requests # get this from pipimport json

2. Then, we can pass in URL to request to get XML format of the website, for example, let scrape WWF website to get animal information. And you can choose parser to depend on your responding format.

url = “https://www.worldwildlife.org/species/directory?direction = desc&sort = extinction_status”r = requests.get(url)soup = BeautifulSoup(r.content, “html.parser”)

3. Now you have the entire website in HTML form in variable soup. Recall that this target URL is a directory or list of all the animals. So, we want to go to each specific animal link from the directory and scrap all data at once. Following are how the website looks, and how can you locate the source code. This step can be done through web browser F12 along with search (ctrl+F).

4. So now, we locate the tags we want to select. Tag <a> contain the animal link in “href” attribute, so we can use these 2 criteria to lock on all the links. First, we want to find all <a> tag in HTML. And then we extract out the “href” attribute. Recall that this is a loop body, and the link is an iterator, so you are dealing with a single line of <a> “href” combo each time.

for link in soup.find_all(‘a’):	a = link.get(“href”)

5. let print out all variable a in python

6. Now we have all href from <a> tag, and we want to limit it to /species/ only, because that is the link we want to go in. Let do condition check on string, once we get /species/some_animal, we have to combine it with http header, so we have entire url of the certain animal.

if (a[0:9] == “/species/” and len(a) < 45):url = “https://www.worldwildlife.org" + aanimal_dict = scrape(url)

7. print out URL

8. So now, we have all the URL we want to go in, let make a function to do beautiful soup search for each URL.

animal_dict = scrape(url)

9. the function is look like this, where we do request again for animal instance

def scrape(url):	try:		r = requests.get(url)		soup = BeautifulSoup(r.content,”html.parser”)

10. And now we want to get some information from animals, let get the name of the animal first. Just inspect page element first with the browser. You can find that <title> contains the name.

11. In bs4, you can also access the text of the enclosed tag by <tag>.text, let do it and parse the title.

title = soup.title.textname = title.split(‘|’)[0]

12. simple output, so you can see in the loop, you get all the animal name.

13. Next, we want to get the image link of the animal, first inspect it with the browser. The image is usually name <img class=xxxx> in HTML.

14. Indeed, it happened to be that first search result of “img” did return the picture of the animal. So we can do following to get the first img result

img = soup.find_all(“img”)[0]

15. simple output

16. you can check image by pasting the link to the browser.

17. Now, let get the text description of the animal, this is going to be tricky, since there are many description sections in the web page, so it is extremely important to spot the unique tag that animal description have that other description doesn’t. First, we locate description by search

18. It happened to be that class “wysiwyg lead” is a unique tag for text description, then when I search wysiwyg lead, the first result is always description for certain animal. So I can do

description = soup.find(‘div’, {“class”:”wysiwyg lead”}).text

19. Simple output

20. It ends up working well. However, sometime, you can’t find a unique tag for a certain element, but you do find a nearby unique tag. So now you want to walk through a few tags to get what you want. For example

21. the class container is everywhere on the web page, so it is impossible to get “Vulnerable” by search class=”container”. However, <strong class=”hdr”> is unique, and it was 2 line before actual conservation status, we can use next_element in bs4.

for item in soup.find_all(“strong”, {“class”: “hdr”}):	if(item.text == “Status”):		status = 			item.next_element.next_element.next_element.text		print(status)

22. simple output

23. So if you combine 3 attributes we have so far, the output will be

24. This is database ready in some sense, you just have to use json.dumps to write output to a JSON file, so you can easily load it to your SQL.

25. One last remainder is that the format of the web page is often the same within one website, so once you find the HTML rule for single instance, you can loop it to see if you get information of all instances correctly. Of course, sometime the format may vary a little, so you have to adjust your searching rule, to make it more specific about tags and attributes.

Use cases: details about why to use it

I think the previous section pretty much talked about use cases. Let me summarize some key content here.

1. request URL, so you can parse it to the desired format

url = “https://www.worldwildlife.org/species/directory?direction = desc&sort = extinction_status”r = requests.get(url)soup = BeautifulSoup(r.content, “html.parser”)

2. find the certain tag and extra data from an attribute

for link in soup.find_all(‘a’):	a = link.get(“href”)

3. directly print out text enclosed by the tag

title = soup.title.text

4. walk down a few elements from a certain tag, can use it when you can’t find the unique tag of desired content, but you do find unique tag nearby

status = item.next_element.next_element

5. Some HTML is well organized, you can actually search section only.

6. This is just a simple walkthrough of how to scrape a website, check out more methods of beautiful soup in https://www.crummy.com/software/BeautifulSoup/.

Alternatives: what other tools are equivalent, why did you choose this tool

Like I mentioned before, you can use python regular expression for scraping as well, it is a lot easier when you use beautiful soup since beautiful soup is more like talking in English, where regular expression is made up of “confused”( at least to me) symbols.

--

--