End-to-End Data Scraping

Felix Pratamasan

5 min readAug 30, 2023

In this project, I created an app using Streamlit for the UI and using Scrapy to scraping coursera data.

Tech Stack:

Scrapy
Streamlit
Pandas
Subprocess

How to Use:

Clone this repository https://github.com/lixx21/coursera-data-scarping.git
Go to spiders directory path with cd coursera_data_scraping/coursera_data_scraping/spiders
Start running streamlit app with following command streamlit run streamlit_app.py

Project Explanation

1. Data Scraping

To scraping the data I used Scrapy and Pandas to stored the data into .csv file. To used Scrapy, we need to define 1 class and the functions like this:

import scrapy
import pandas as pd
    
class coursera_data(scrapy.Spider):
    
    name = "coursera_data" #this name must be unique within the project
    
    def __init__(self, course, page, **kwargs):
        super(coursera_data, self).__init__(**kwargs)
        self.course = course
        self.page = int(page)
        
    def start_requests(self):
        #create course path so we can use it into query string in coursera
        course_format = self.course.replace(" ", "+")
        
        for index in range(self.page):
            url = f"https://www.coursera.org/search?query={course_format}&page={index+1}"

            yield scrapy.Request(url = url, callback=self.parse)
            
    def parse(self, response):
        
        course_title = response.xpath('//h3[@class="cds-119 cds-CommonCard-title css-e7lgfl cds-121"]//text()').extract()
        course_by= response.xpath('//p[@class="cds-119 cds-ProductCard-partnerNames css-dmxkm1 cds-121"]//text()').extract()
        course_link = response.xpath('//a[@class="cds-119 cds-113 cds-115 cds-CommonCard-titleLink css-1smvlxt cds-142"]/@href').extract()
        # because the course_link get the endpoint of the url
        # therefore we create a format using https://www.coursera.org
        clean_coursera_urls = []
        for course_url in course_link:
            clean_format = f"https://www.coursera.org{course_url}"
            clean_coursera_urls.append(clean_format)
        
        skills_you_gain = response.xpath('//div[@class="cds-CommonCard-bodyContent"]/p[@class="cds-119 css-dmxkm1 cds-121"]/text()').extract() 
        course_ratings = response.xpath('//div[@class="product-reviews css-pn23ng"]/p[@class="cds-119 css-11uuo4b cds-121"]/text()').extract()
        course_reviews = response.xpath('//div[@class="product-reviews css-pn23ng"]/p[@class="cds-119 css-dmxkm1 cds-121"]/text()').extract()
        course_levels = response.xpath('//div[@class="cds-CommonCard-metadata"]/p[@class="cds-119 css-dmxkm1 cds-121"]/text()').extract()
        
        data = {
            "course_title":  course_title,
            "course_by": course_by,
            "course_link":clean_coursera_urls,
            "skills_you_gain": skills_you_gain,
            "course_ratings": course_ratings,
            "course_reviews": course_reviews,
            "course_level": course_levels
        }
        
        if len(course_title) > 0 :  
            # Append data to "output.csv"
            first_df = pd.read_csv("output.csv")
            new_df = pd.DataFrame(data)
            
            pd.concat([first_df, new_df]).to_csv('output.csv', index=False)
        else:
            pass
                
        self.logger.info(data)
        
        yield data

name variable need to be unique within the project because, if we want to start scraping using Scrapy, we will call that name like this:

scrapy crawl coursera_data

The first function __init__() is used to get the data from user’s input and pass to self.course and self.page, therefore other function can use it
The second function start_requests() is used to pass the url that we want to scrap and start running parse() function to start scraping the url
The last function is parse() is the main logic for scraping the data. We used XPATH to get the data from the element that we want. Then after that we saved and concat the data to our .csv file therefore, the data can be append and stored in that .csv file.

2. Create User Interface

To create the user interface, I used Streamlit. This streamlit is very useful if we want to created quick app for our data. The streamlit’s code for this projcet is like this:

def start_scrapping(course, page):
    #run scrapping spyder using subprocess
    
    call(["scrapy", "crawl", "coursera_data", "-a", f"course='{course}'", "-a" f"page={page}"])
    
    return "success"

def download_file(dataframe):
    def convert_df(df):
     
        return df.to_csv().encode('utf-8')

    csv = convert_df(dataframe)

    return csv

def clear_data(df):
    #columns = ['course_title','course_by','course_link','skills_you_gain','course_ratings','course_reviews','course_level']
    empty_df = df.iloc[0:0]
    empty_df.to_csv("output.csv", index=False)
    
    return empty_df

I have 3 functions here before I used the Streamlit components. There are:

start_scraping(): because to run scrapy we need to do that using bash command, then we need Subprocess library to help us using bash command in python script. Therefore, this function is used to start the scraping data using bash command
download_file(): this function is used to download the data that user has been scraped in .csv file using Pandas library
clear_data(): this function is used to clear the data in dataframe and data in .csv file, if user want to delete the data that they have scraped.

After that we can start using Streamlit components. First thing to do is to create title and caption st.title() and st.caption().

st.title('Coursera Data Scraping')
st.caption('This app will scraping courses data from coursera')

Then, I using st.form() to create form for user can input the course that they want to scrap using st.text_input() and how many page user want to scrap using st.number_input(). Then the last part for this form is submit button using st.submit_form_button() that is used to start the scraping using start_scrapping() function.

with st.form(key = 'user_info'):
    course = st.text_input('What course you want to scrap ? (i.e. data engineer)')
    page = st.number_input('How many page you want to scrap ?', 1, 100)
    
    submit_form = st.form_submit_button(label="Submit", help="Click to start scraping")
    if submit_form:
        start_scrapping(course, page)

After that, We can display the dataframe from data that user has been scraped. but first, we need to create columns using st.columns() this is used to make the Clear Data button can be side by side to Download data as csv button.

Then if user click the clear data button, it will directly run the clear_data() function and show the current empty data in dataframe using st.datafrane() but if user not click the clear data button it will shows the data that in the dataframe or data that has been scraped (not empty data). Then, if user click Download data as CSV button, it will download the data using download_file() function and st.download_button().

df = pd.read_csv("output.csv")

colmn1, colmn2 = st.columns([1,1])
with colmn1:
    clear_button = st.button('Clear Data', type="primary")
        
with colmn2:    
    csv = download_file(df)
    st.download_button(
        label="Download data as CSV",
        data=csv,
        file_name='output.csv',
        mime='text/csv',
        type="secondary"
    )

if clear_button:
    df = clear_data(df)
    st.dataframe(df)
else:
    st.dataframe(df)

This is the complete code for Streamlit:

import streamlit as st
import pandas as pd
from subprocess import call

def start_scrapping(course, page):
    #run scrapping spyder using subprocess
    
    call(["scrapy", "crawl", "coursera_data", "-a", f"course='{course}'", "-a" f"page={page}"])
    
    return "success"

def download_file(dataframe):
    def convert_df(df):
     
        return df.to_csv().encode('utf-8')

    csv = convert_df(dataframe)

    return csv

def clear_data(df):
    #columns = ['course_title','course_by','course_link','skills_you_gain','course_ratings','course_reviews','course_level']
    empty_df = df.iloc[0:0]
    empty_df.to_csv("output.csv", index=False)
    
    return empty_df

st.title('Coursera Data Scraping')
st.caption('This app will scraping courses data from coursera')

with st.form(key = 'user_info'):
    course = st.text_input('What course you want to scrap ? (i.e. data engineer)')
    page = st.number_input('How many page you want to scrap ?', 1, 100)
    
    submit_form = st.form_submit_button(label="Submit", help="Click to start scraping")
    if submit_form:
        start_scrapping(course, page)

df = pd.read_csv("output.csv")

colmn1, colmn2 = st.columns([1,1])
with colmn1:
    clear_button = st.button('Clear Data', type="primary")
        
with colmn2:    
    csv = download_file(df)
    st.download_button(
        label="Download data as CSV",
        data=csv,
        file_name='output.csv',
        mime='text/csv',
        type="secondary"
    )

if clear_button:
    df = clear_data(df)
    st.dataframe(df)
else:
    st.dataframe(df)

App Overview

THEN FINALLY 🎉🎉: all the steps to create end-to-end data scraping is success !!! 👍👨🏻‍💻👏🏻.

If you have any question and want to keep contact with me you can reach me on LinkedIn: https://www.linkedin.com/in/felix-pratamasan/