Extract data from any webpage using this technique!
In this article, we are going to take a look at how to scrape data from web using BeautifulSoup and Selenium and clean the data for further analysis.
By: Daksh Bhatnagar
INTRODUCTION
Extracting data from sources these days is really important since data is mostly stored in databases, a hosted web page, or can only be accessed via an API. For scraping data from the database, we generally use SQL (Structured Query Language), (HELLO, ARTICLE IDEA FOR THE FUTURE!!) however for data extraction from the websites we use packages that are built specifically for such use cases. The skill is highly required in organizations where the data is large and stored in a complex format.
Today, we are going to look at two of the most used packages BeautifulSoup and Selenium to scrape the data from a web page that is hosted already and we would use Python to get the job done.
Selenium is an automation software however BeautifulSoup isn’t.
While there are great benefits of web scraping the data (cost-effectiveness and easy access) there is one thing to keep in mind before web scraping the data which is the legal consequences. While web scraping isn’t illegal, the issue is that not all websites will allow you to scrape the data from their website, mainly because you are slowing down the performance of the website.
APPLICATION
Let’s now start scraping the data. We will use BeautifulSoup first, clean the data and then visualize it, and then we will repeat the same process by using Selenium.
WEB SCRAPING JOB POSTINGS USING BEAUTIFULSOUP
First of all, let’s make some imports. We are using the requests
library for getting a response object from the website, re
package for Regular Expression purposes, pandas
for converting the data into a data frame, and then there is matplotlib
& seaborn
for visualization purposes.
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
First of all, what you want to do is get a response object using the requests library. This will tell us if we will be able to scrape the data on the website or not. If the response object prints 200, then you are good to go. Now, let’s create a BeautifulSoup object and use an HTML parser
soup = BeautifulSoup(response.text,'html.parser')
Creating this soup object will help us navigate the website and go through the various HTML tags and the classes used to build the webpage to extract the contents of our interest. We will be scraping job postings from shine.com.
#Fetching Job titles
req = soup.select('div h2[itemprop="name"]')
#Cleaning the data using list comprehension
titles = [r.text for r in req]
So, what’s happening above is that we are using the select method and going into div and h2 tags and then getting all the results where itemprop=name. This gets us results that are not cleaned hence we extract text from all items using list comprehensions and this is what we get
While there is a host of things that you can do to clean up this data, what we are doing is getting rid of the pipe symbol and any extra spaces in the text. Below would be the code for the purpose
#getting rid of the pipe symbol
titles1 = [t.replace("|","") for t in titles]
#getting rid of any extra spaces
titles = [t.replace(" ", "") for t in titles1]
Next, we will fetch the firm name. The find_all method will get all the classes with the name mentioned however with the find method, BeautifulSoup returns the first element with the same class name as you provided.
#fetching all the results that contains the employer name
results = soup.find_all('div', class_='jobCard_jobCard_cName__mYnow')
#Getting only the text from the results
cleanresults = [o.text for o in results]
sub_str = "Hiring"
#Splitting data based on condition and getting first element
companies = [o.split(sub_str)[0] for o in cleanresults]
We will follow a similar kind of approach toward job location, experience, and the number of positions.
#Collecting the Job Locations results
results = soup.find_all('div', class_='jobCard_jobCard_lists__fdnsc')
#fetching the text from the results collected
results = [l.div.text for l in results]
#replacing + sign with ,
locations = [l.replace("+", ",") for l in results]#Using regex to get rid of any numbers
pattern = r'[0-9]'
locations = [re.sub(pattern, '', l) for l in locations]
Let’s follow a similar process for the rest of the data
#Fetching the experience requirements
experience = [l.find_all('div')[-1].text for l in loc]
#Gettig the number of vacancies
vacancies = soup.find_all('ul', class_='jobCard_jobCard_jobDetail__jD82J')#Cleaning up the vacancies results
vac = [v.text.split("Positions")[0][-3:] for v in vacancies ]
vac = [v.replace('lar', '1') for v in vac]
strpattern = r'[a-z]'
vac = [re.sub(strpattern, '', l) for l in vac]
vacancies = [v.replace(' ','') for v in vac]
The split method like earlier is splitting the data based on the string ‘Positions’ and from that, we are getting the first element and fetching the last 3 characters of the element for each iteration. Next, we are just cleaning up the data to get only the numbers. Next, we convert the lists to a data frame.
df = pd.DataFrame({'Titles':titles, 'Firm Name': companies,
'Job Location':location, 'Experience':experience, 'Positions': vacancies})
Below is how the data frame looks like
Next, we are doing some data pre-processing where the duplicates based on the titles column are dropped and the positions column data type is converted to an integer. We are also adding a category column that tells if the job is for a fresher or experienced professional.
df = df.drop_duplicates(subset=['Titles'])
df['Positions'] = df['Positions'].astype('int32')
#Creating a New Column
df['Category'] = ['Fresher' if '0' in i else 'Experienced' for i in df['Experience']]
When we plot the data here is what the charts will look like
WEB SCRAPING AMAZON PRODUCTS USING SELENIUM
Let’s now move on to Selenium and see how we can scrape the product name, prices, and the number of reviews from Amazon. Before coding, please be sure to download a ChromeDriver file (.exe) for your operating system and save it in your working directory
Let’s start off with the imports.
from selenium import webdriver
from time import sleep
We will now define the path of the ChromeDriver and set ourselves up for automating the web scraping
driver_path = 'chromedriver.exe'
#Opens the browser
browser = webdriver.Chrome(executable_path=driver_path)#Goes to amazon website
browser.get('https://www.amazon.in')#Maximize Window
browser.maximize_window()#Finding Elements
input_search = browser.find_element_by_id('twotabsearchtextbox')
search_button = browser.find_element_by_xpath("(//input[@type='submit'])[1]")#Inserting the search keyword
input_search.send_keys("Smartphones under 50000")#waits for 2 seconds
sleep(2)#click on search button to get the results
search_button.click()
What’s happening in the above code is we are opening the website, maximizing the window size and then we are finding the elements for the input search button, and then entering the keywords which in our case is ‘Smartphones under 50000’.
After entering the keyword we are waiting for 2 seconds and then click on submit button to fetch the results based on the keyword.
Next comes the part where we are getting the results using a loop.
#Lists where we want to store the data
products = []
prices = []
numReviews = []
for i in range(10):
#printing the page number
print('Scraping page', i+1) #Getting the product names on the page
product = browser.find_elements_by_xpath("//span[@class='a-size-medium a-color-base a-text-normal']")
#Getting the product prices on the page
price = browser.find_elements_by_xpath("//span[@class='a-price-whole']") #Getting the number of reviews on the page
numReview = browser.find_elements_by_xpath("//span[@class='a-size-base s-underline-text']")
#Iterating through each page to get the text of individual product names, prices and number of reviews
for p in product:
products.append(p.text)
for pr in price:
prices.append(pr.text)
for n in numReview:
numReviews.append(n.text) #setting up the next button once a page is scraped
next_button = browser.find_element_by_xpath("//span[@class='s-pagination-strip']")
#Clicking on next button
next_button.click() #Waiting for 2 seconds for the page to load
sleep(2)#Closing the browser window once the loop is finished running
browser.quit()
Now that the data has been scraped we will go ahead and convert this into a data frame, do some pre-processing and then visualize the columns.
#List of lists of product, prices and reviews
data = [products, prices, numReviews]
#Creating a dataframe
import pandas as pd
df = pd.DataFrame(data).T#Renaming columns
df.columns = ['Products', 'Prices', 'NumReviews']#Dropping any rows with nan values
df.dropna(inplace=True)#Getting the brand name of the mobile phone
df['Brand_Name'] = [(i.split(' ')[0]) for i in df.Products]#Removing the comma in the prices column
df.Prices = [i.replace(',', '') for i in df.Prices]#Converting the prices column to integer type
df['Prices'] = df['Prices'].astype('int64')#Removing the comma in the prices column
df.NumReviews = [i.replace(',', '') for i in df.NumReviews]#Converting the prices column to integer type
df['NumReviews'] = df['NumReviews'].astype('int64')
Our data looks ready to be plotted and we can now generate some insights from it. We will be using Seaborn and Matplotlib libraries for the plotting purposes
import seaborn as sns
import matplotlib.pyplot as plt
df['Brand_Name'].value_counts().plot(kind='bar', figsize=(22,7))
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.xlabel('Brands')
plt.ylabel('Count')
plt.title('Mobile Phones countplot, Brand Wise')
plt.show()
The chart above tells us which brand has how many phones in the entire data. Redmi is leading followed by Samsung.
plt.figure(figsize=(22,7))
sns.kdeplot(df['Prices'])
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.title('Mobile Prices distribution')
plt.show()
The chart above tells that majority of the phones are within the range of 20000 INR.
plt.figure(figsize=(22,7))
sns.histplot(df['NumReviews'])
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.title('NumReviews distribution')
plt.show()
The major count of reviews is within the range of 50000 meaning. If we take a look at the first bar, it would mean there are around 58 phones that have approximately 10000 reviews.
plt.figure(figsize=(22,7))
sns.barplot(x=df['Brand_Name'], y=df['Prices'])
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.title('Mobile Prices distribution')
plt.show()
Brand-wise mobile phone prices
CONCLUSION
- Data Mining is an important skill to develop because data unlike in the old days are now stored in databases and complex formats.
- While web scraping isn’t illegal, the issue is that not all websites will allow you to scrape the data from their website, mainly because you are slowing down the performance of the website.
- We have learned how to scrape data using BeautifulSoup and Selenium
- Up Next, I’ll be covering more data mining tools and libraries that help us automate the process of web scraping.
- If you liked the tips and they proved to be helpful to you, I’d appreciate it if you can give the article a clap and follow me for more upcoming Data Science, Machine Learning, and Artificial Intelligence articles.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).