The Data Scientist Journey, Chapter 8

Scrapin’ the Web

Jacob Menashi
3 min readApr 29, 2019

This is the 8th installment in a ???-part series about my journey to becoming a data scientist in the field of sports analytics. With a background in sports production and an education as a web developer, follow me as I learn new languages, work on projects, and have a grand ol’ time doing it.

One aspect of learning Python that I was really looking forward to was web scraping, the act of getting data from off the web through the HTML. There’s lots of free data out there to use for scraping, including many options when it comes to sports (there’s also a lot of data that you’re not supposed to scrape, so you have to be careful!). I’m going to quickly go through the process of web scraping with Python today, before diving deeper into it next week with some projects.

Using BeautifulSoup to Scrape

Like most of the things I’ve shown on this blog in the last couple installments, we get the ability to scrape sites with use of a package — this one is called Beautiful Soup. We have some example HTML here to show how BeautifulSoup can parse through it:

With this HTML, you can see there’s a few different elements, a couple <div> tags as well as an <ol> ordered list with some <li> list items. Using BeautifulSoup, we can access these elements in many different ways.

This is the key line in using the package, it specifies what we are parsing through (saved into a variable called ‘html’) and then how we want to parse it, obviously using the html.parser in this case, though you can also look through XML with this package.

For those of you who have a JavaScript background like me, the methods you use to get certain elements/data out of HTML will look very similar, as they match up quite nicely to what you see in regular JS or JQuery. For example, let’s say you wanted to grab all the elements with a class of “special”.

Looks familiar right? the “get_text” method is like calling “.innerHTML” or “.innerText”, and the “name” method is the same. The “attrs” method is helpful as well, if an element has a class, id, or even a datatype.

Continuing with that analogy, you can find parents, children, and siblings of HTML elements just like you can do in JavaScript:

In this example, we are using select, which finds every instance of the parameter and returns a list. The [1] signifies that we want the second element in this list, and “find_previous_sibling” is a method that will eventually return us the ordered list, as it is the previous sibling of the selected <div> element. Navigating isn’t too hard!

Next week, some real web scraping projects!

--

--