Scraping Indeed job postings

Photo by KOBU Agency on Unsplash

Option 1: Subscribe to Specrom’s Indeed Scraper API

Option 2: Full service web scraping service.

Option 3: Scrape indeed.com on your own

### Using Selenium to extract Indeed.com's raw html source from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import Select import time from bs4 import BeautifulSoup import numpy as np import pandas as pd test_url = 'https://www.indeed.com/' option = webdriver.ChromeOptions() option.add_argument("--incognito") chromedriver = r'chromedriver.exe' browser = webdriver.Chrome(chromedriver, options=option) browser.get(test_url) text_area = browser.find_element_by_id('text-input-what') text_area.send_keys("Web scraping") text_area2=browser.find_element_by_id('text-input-where') text_area.send_keys("Atlanta, GA") element = browser.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button') element.click() html_source = browser.page_source browser.close()

Using BeautifulSoup to extract Indeed job postings

Extracting job titles

# extracting job titles soup=BeautifulSoup(html_source, "html.parser") job_title_src = soup.find_all('h2', {'class','title'}) job_title_list = [] for val in job_title_src: try: job_title_list.append(val.get_text()) except: pass job_title_list #Output ['Data Analyst', 'Data Analyst', 'Data/Reporting Analyst', 'Sr. Data Analyst', 'Data Analyst', 'Data Analyst', 'Behavior Data Analyst - Marcus Autism Center - Behavioral Analysis Core', 'Data Analyst', 'Data Analyst', 'Police Analyst', 'Data Analyst (2021-1614)', 'Employee Data Analyst', 'Data Analyst', 'Data and Research Analyst', 'Data Analyst (13255)']

Extracting company names

# extracting Indeed addresses company_name_src = soup.find_all('span',{'class', 'company'}) company_name_list = [] for val in company_name_src: company_name_list.append(val.get_text()) company_name_list # Output ['KIPP Foundation', 'Emory University', 'City of Atlanta, GA', 'The Coca-Cola Company', 'KIPP Metro Atlanta Schools', 'Spartan Technologies', "Children's Healthcare of Atlanta", 'ARK Solutions', 'Anthem', 'City of Forest Park, GA', 'Atrium CWS', 'Salesforce', 'Sovos Compliance', 'Southern Poverty Law Center', 'Baer Group']

Extracting snippets

# extracting snippets from each job postings snippet_src = soup.find_all('div', {'class', 'summary'}) snippet_list = [] for val in snippet_src: snippet_list.append(val.get_text()) snippet_list[:3] # Output ['\n\nMaintain and troubleshoot the integrity of data linkages between data source systems and data warehouse.\nCollaborate with other Data Team members to develop and...\n', '\n\nCreates and maintains a data dictionary and meta data.\nAnalyzing data reporting data for clinical outcomes, qualitative and other types of research.\n', '\n\n3 years of work experience in creation, reporting, and/or management of data or closely related tasks (not including data entry).\n']

Converting into CSV file

Scaling up to a full crawler for extracting all Indeed job postings

  • Once you scale up to make thousands of requests to fetch all the pages, the indeed.com servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.
  • To make it more likely to successfully fetch data for all USA, you will have to implement:
  • rotating proxy IP addresses preferably using residential proxies.
  • rotate user agents
  • Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jay M. Patel

Cofounder/principal data scientist at Specrom Analytics (specrom.com) natural language processing and web crawling/scraping expert. Personal site: JayMPatel.com