Web Scraping and Coursera

Siddharth M
Analytics Vidhya
Published in
5 min readMay 25, 2020

Hi.. I was working on a course recommendation project during a hackathon and had to generate dataset for courses. Now there is a very low availability of such dataset and hence the only option turned out to be web scraping.

Web scrapping image from Google images

Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet seemingly endless frontier.

That was just some random definition about it. Well like the above image you need some techniques to scroll through the web application and get the features to build our dataset.

The best way to start is by using a library that already does this. Yes… that’s correct , we include Beautiful Soup library for this purpose. You can get the documentation here: https://pypi.org/project/beautifulsoup4/

For our understanding about how it works lets code through an example to scrap a website and generate dataset for that. We will work on scraping one of our famous learning website Coursera. Here we have a great opportunity to scrap the data as we have a URL which contains all the course with details as a card : https://www.coursera.org/courses

Website showing all courses available.

First and foremost we need to import the required libraries in for scraping.

from bs4 import BeautifulSoup                   
import requests

Here the BeautifulSoup is used for scraping purpose and the requests library will help us send Http request from python to website to get the web content.

response = requests.get("https://www.coursera.org/courses")

Now we need to use get method in which we pass the link which we want to parse as an URL. requests.get(“…”) will help us get the content of web and we can save it to a variable say, response.

html_soup = BeautifulSoup(response.content, 'html.parser')

From here we are making our soup. So we have BeautifulSoup method which intakes two parameter one is the content we want and the other is what type of parser we want to get our page parsed with. We use html.parser here. We get this to a variable called html_soup. One thing we can see is when we are looking at the interface the courses are listed as a list with say upto 10 courses in one page which is followed upto 100 pages. So we cant simply implement our soup on one page as we need to get all the courses and hence we need to use 2 loops, one for getting each page and other for iterating through each list.

url = "https://www.coursera.org/courses?page=" +str(i) + "&index=prod_all_products_term_optimization"

This is the url we can find in every page only the number after page= changes and hence in first loop we can place this code to get url by making the number dynamic.

This helps us simply scrap through the entire courses by just passing the tag, what we want to parse and the class. To make this understand simple way let take an example.

Just look at the highlighted one. That is the heading associated or the course title. We can see the tag associated with this is ‘h2’ and class is ‘color-primary-text card-title headline-1-text’ , these are the first and third parameter. In second parameter we pass an empty list which we want to append to in each course. And the content we are scrapping is saved here which become a entire column of a data frame later.

You can see in the function that i used a for loop to get 100 pages. This may vary according to you and i just intended to do this for getting few course data as a demo. For second loop i have taken from 0 to 9 to iterate through all the courses in one page. As mentioned earlier we take the soup variable to get data by passing content and parser through BeautifulSoup(…).

x = soup.find_all(html_tag, class_ = tag_class)[j].get_text()

Later we use find_all method in soup to get the exact text that is referenced from scrapping and saved it to a variable x. This text is appended to a list in every iteration hence getting the entire column dataset.

course_title = []               
course_organization = [] course_Certificate_type = []
course_rating = []
course_difficulty = [] course_students_enrolled = []

Create empty lists for whatever we want to scrap from website.

auto_Scrapper_Class('h2',course_title,'color-primary-text card-title headline-1-text') 
auto_Scrapper_Class('span',course_organization,'partner-name m-b-1s')
auto_Scrapper_Class('div',course_Certificate_type,'_jen3vs _1d8rgfy3')
auto_Scrapper_Class('span',course_rating,'ratings-text')
auto_Scrapper_Class('span',course_difficulty,'difficulty')
auto_Scrapper_Class('span',course_students_enrolled,'enrollment-number')

Now we pass all the required arguments and wait for the function to do its work.

Then we use our basic pandas to make a data frame called course_df by making all the data into columns each with a index. Later we simple sort it with respect to the title and then use the ,

course_df.to_csv(UCoursera_Courses.csv)

And we make a new csv file our the scraped data. This helps us get the new dataset which can be later used for data analysis and visualization.

The sample output looks like this :

Dataset sample in a nutshell.

If you loved this article and found it helpful do follow. I would love to collaborate and work on projects with you.

The code to do scraping :

The scraped dataset:

Thanks again.. and enjoy scraping… :)

--

--

Siddharth M
Analytics Vidhya

Iam a tech enthusiast and an open source contributer. I love working with diffrent linux os and have deep intrest in data science and deep learning