Web Scraping Glassdoor Job Listings for Data Analysis
Introduction
Glassdoor is a popular platform for job seekers and employers alike, offering a wealth of information on job listings, salaries, and company reviews. In this article, we will demonstrate how to extract job listings from Glassdoor using Python and Beautiful Soup for data analysis. This information can help job seekers make data-driven decisions and gain insights into job market trends.
If you are not able to visualise the content until the end, I invite you to take a look here to catch-up!
Requirements
To follow along with this tutorial, you will need the following:
- Python 3.x installed on your system.
- Beautiful Soup 4 and the Requests library installed. You can install them using pip:
pip install beautifulsoup4 requests
Step 1: Setting Up the Project
First, create a new directory for your project and navigate to it:
mkdir glassdoor_scraper
cd glassdoor_scraper
Next, create a Python file to write your code:
touch glassdoor_scraper.py
Step 1.5: Logging in to Glassdoor
Some Glassdoor features may require you to log in before accessing job listings or additional information. To handle the login process, you can use the requests.Session
object, which allows you to persist certain parameters across requests, such as cookies.
I take this opportunity to add a note that Glassdoor is a constantly growing and evolving website. So, it may be possible that some user journey, class names or page URL have changed already.
So let’s get back to the subject. First, you need to create a new session:
session = requests.Session()
Next, you need to obtain your Glassdoor email and password. For security reasons, it is best not to store this information directly in your script. Instead, you can use environment variables:
import os
email = os.environ.get("GLASSDOOR_EMAIL")
password = os.environ.get("GLASSDOOR_PASSWORD")
To set environment variables, you can use the following commands in your terminal:
export GLASSDOOR_EMAIL="your_email@example.com"
export GLASSDOOR_PASSWORD="your_password"
Now, you can log in to Glassdoor by submitting your email and password through a POST request to the login endpoint:
login_url = "https://www.glassdoor.com/profile/login_input.htm"
login_data = {
"username": email,
"password": password,
}
response = session.post(login_url, data=login_data, headers={'User-Agent': 'Mozilla/5.0'})
# Check if the login was successful
if response.status_code == 200 and "Logout" in response.text:
print("Login successful")
else:
print("Failed to log in")
Once you are logged in, you can use the session
object to send requests that require authentication. Replace the requests.get()
function in Step 2 with the following line:
response = session.get(url, headers={'User-Agent': 'Mozilla/5.0'})
Now your script should be able to handle the login process and access job listings that require authentication. Make sure to follow the same approach when submitting any subsequent requests to Glassdoor within your script.
Step 2: Fetching Job Listings
To fetch job listings, you need to construct a URL with the desired search query, location, and page number. Replace the query
and location
variables with your preferred job title and location:
import requests
base_url = "https://www.glassdoor.com/Job/jobs.htm"
query = "data scientist"
location = "new york"
page = 1
url = f"{base_url}?q={query}&l={location}&p={page}"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
# Check if the request was successful
if response.status_code == 200:
page_content = response.text
print(page_content)
else:
print(f"Failed to fetch content from {url}")
Step 3: Parsing HTML Content with Beautiful Soup
Now that we have the web page content, we can use Beautiful Soup to parse the HTML and extract job listings.
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_content, "html.parser")
print(soup.prettify())
Step 4: Extracting Job Listings
After inspecting the HTML structure of the Glassdoor job listing page, we can see that each job listing is contained within a ‘div’ element with the class ‘jobContainer’. Let’s extract all job listings using Beautiful Soup’s find_all
method:
job_listings = soup.find_all("div", class_="jobContainer")
for job in job_listings:
title = job.find("a", class_="jobLink").text
company = job.find("div", class_="jobInfoItem jobEmpolyerName").text.strip()
location = job.find("span", class_="jobInfoItem jobLocation").text.strip()
print(f"{title}\n{company}\n{location}\n")
Step 5: Extracting Additional Information (Optional)
You may want to extract additional information such as salary estimates, job description, or company ratings. To do this, you can inspect the HTML structure of the job listing page further and use Beautiful Soup’s methods to extract the desired information. For example, to extract the salary estimate:
salary_estimate = job.find("span", class_="jobInfoItem jobSalaryEstimate")
if salary_estimate:
salary_estimate = salary_estimate.text.strip()
else:
salary_estimate = "Not provided"
print(salary_estimate)
Step 6: Exporting the Data
Once you have extracted the desired job listing information, you can export the data to a CSV file for further analysis:
import csv
def main():
# Fetch and parse the page content, then extract job listings (steps 1.5, 2-4)
with open("glassdoor_jobs.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = ["title", "company", "location", "salary_estimate"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for job in job_listings:
title = job.find("a", class_="jobLink").text
company = job.find("div", class_="jobInfoItem jobEmpolyerName").text.strip()
location = job.find("span", class_="jobInfoItem jobLocation").text.strip()
salary_estimate = job.find("span", class_="jobInfoItem jobSalaryEstimate")
if salary_estimate:
salary_estimate = salary_estimate.text.strip()
else:
salary_estimate = "Not provided"
writer.writerow({"title": title, "company": company, "location": location, "salary_estimate": salary_estimate})
if __name__ == "__main__":
main()
Step 7: Paginating Through Job Listings
To scrape multiple pages of job listings, you can use a loop that iterates through the desired number of pages and updates the page
variable in the URL:
num_pages = 5
for page in range(1, num_pages + 1):
url = f"{base_url}?q={query}&l={location}&p={page}"
# Fetch and parse the page content, then extract job listings (steps 2-4)
Make sure to add a delay between requests using the time.sleep()
function to avoid overwhelming the Glassdoor servers or getting blocked:
import time
# ...
for page in range(1, num_pages + 1):
# ...
# Sleep for a few seconds before the next request
time.sleep(5)
Conclusion
In this article, we demonstrated how to scrape Glassdoor job listings for data analysis using Python and Beautiful Soup, including handling the login process. You can use this information to gain insights into job market trends, analyze salary data, or optimize your job search. Please remember to respect Glassdoor’s terms of service and robots.txt file when using web scraping, and do not use the extracted data for commercial purposes without permission. Happy coding!
If you found this article valuable or insightful, I’d greatly appreciate your support by following me, Jonathan Mondaut, here on Medium. I’m committed to sharing more engaging and practical content that will help you stay ahead in today’s fast-paced world. Don’t miss out on my upcoming articles — follow me now, and let’s learn and grow together!