Requests-HTML: The modern way of web scraping.

David Kowalk
Analytics Vidhya
Published in
4 min readApr 16, 2020

--

Python is an excellent tool in your toolbox and makes many tasks way easier, especially in data mining and manipulation. As a freelancer, people often come to me for the same reasons: Python’s difficult, the code is about as understandable as a bowl of spaghetti and generally inaccessible for beginners. “Why do I need 3 different libraries to download some data off a website?” This is very unfortunate since web scraping is something especially data scientists can use almost every day.

In this tutorial, I will show you the basics of web scraping with requests-html, the modern way of scraping data off of websites.

The Setup

After you’ve installed Python, you’ll need to import the library I’ll use here with pip. Open your terminal (Powershell on Windows, Terminal on Mac) and type:

pip install requests-html

Then create a text-file with the name app.py. This is your application file. You can use any ordinary text-editor to edit this file. I would recommend Atom, Sublime Text or Visual Code. Write into your file:

import requests-html
import csv

requests-html has the advantage over urllib and beautiful-soup since it’s way easier to use and combines the features of the two into one library.

Extracting Data

Now, that we’re all set up, we can go ahead and download data:

from requests_html import HTMLSession
session = HTMLSession()

The sessionrequests websites like a normal web browser, and most importantly, it looks to the website like a browser as well. Let’s give our session a website to scrape:

url = https://github.com/davidkowalk?tab=repositories
response = session.get(url)

This will download the so-called source of your page, in this case, my GitHub profile. Let’s download all the names of my repositories! Open the website in the browser of your choice, right-click on one of the list-items and select Inspect-Element.

You should now see the compiled page-source.

The list is in a <div> Element with the id “user-repositories-list” Let’s search for that.

container = response.html.find(“#user-repositories-list”, first=True)
list = container.find(“li”)

This will return a list of all the repositories. A # is the html-marker for an id-tag. Now for every list, we want to find the name of the repository and its language to write in a file. First, we’ll prepare the data structure. The list should contain a header and then we want to append our data points.

sheet = [[“Name”, “Language”]]

Now let’s iterate over all elements in the list and find name and language.

for item in list:
elements = item.text.split(“\n”)
name = elements[0]
lang = elements[2]
sheet.append([name, lang])

elements is an array of all the info in each item on the list. For example, for the first item, it would look like this:

[‘FunctionSynthesizer’, ‘Description’, ‘Python’, ‘2’, ‘License’, ‘Apr 15, 2020’]

The first element (Nr. 0) is the name. The language is in third place so it has the id 2.

Saving your data

Now we have a 2-dimensional list of all the info we wanted. All we have to do is to write it in a file. CSV is an open table format, which can be opened with Excel, or similar programs. The CSV-Package implemented in Python can we used to read and write files easily:

with open(“Filename.csv”, “w”, newline=’’, encoding=’utf-8') as file:
writer = csv.writer(file)
writer.writerows(sheet)

You can now run your application in the terminal:

python app.py

This should produce a CSV-file you can open with Excel.

You may notice that this isn’t 100% clean. Not all fields are selected correctly since this method is dependent on a specific format which is not always fulfilled. As an exercise try and fix this yourself.

You can refer to the documentation of

Requests-HTML has many more features, like asynchronous data collection, JavaScript support and automatic redirects. Certainly a great tool if you want to get further into web scraping.

If you have any further questions, feel free to comment below and I’ll try to get back to you.

--

--