Scraping Indeed job postings

Photo by KOBU Agency on Unsplash

Option 1: Subscribe to Specrom’s Indeed Scraper API

We have an Indeed Scraper API that will extract all the pertinent job posting information such as company name, city, snippet, job title etc. by just specifying a search query and a location.

Option 2: Full service web scraping service.

If you just need job postings data as a CSV or excel file, than simply contact us for our full service web scraping service. You can simply sit back and let us handle all the backend issues to get the data you need.

Option 3: Scrape indeed.com on your own

Python is great for web scraping and we will be using a library called Selenium to extract Job postings from Indeed for Atlanta, GA.

### Using Selenium to extract Indeed.com's raw html source from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import Select import time from bs4 import BeautifulSoup import numpy as np import pandas as pd test_url = 'https://www.indeed.com/' option = webdriver.ChromeOptions() option.add_argument("--incognito") chromedriver = r'chromedriver.exe' browser = webdriver.Chrome(chromedriver, options=option) browser.get(test_url) text_area = browser.find_element_by_id('text-input-what') text_area.send_keys("Web scraping") text_area2=browser.find_element_by_id('text-input-where') text_area.send_keys("Atlanta, GA") element = browser.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button') element.click() html_source = browser.page_source browser.close()

Using BeautifulSoup to extract Indeed job postings

Once we have the raw html source, we should use a Python library called BeautifulSoup for parsing the raw html files.

Extracting job titles

From inspecting the html source, we see that job titles have h2 tags and belong to class ‘title’.

# extracting job titles soup=BeautifulSoup(html_source, "html.parser") job_title_src = soup.find_all('h2', {'class','title'}) job_title_list = [] for val in job_title_src: try: job_title_list.append(val.get_text()) except: pass job_title_list #Output ['Data Analyst', 'Data Analyst', 'Data/Reporting Analyst', 'Sr. Data Analyst', 'Data Analyst', 'Data Analyst', 'Behavior Data Analyst - Marcus Autism Center - Behavioral Analysis Core', 'Data Analyst', 'Data Analyst', 'Police Analyst', 'Data Analyst (2021-1614)', 'Employee Data Analyst', 'Data Analyst', 'Data and Research Analyst', 'Data Analyst (13255)']

Extracting company names

The next step is extracting company names. We see that it is span tag of class ‘company’.

# extracting Indeed addresses company_name_src = soup.find_all('span',{'class', 'company'}) company_name_list = [] for val in company_name_src: company_name_list.append(val.get_text()) company_name_list # Output ['KIPP Foundation', 'Emory University', 'City of Atlanta, GA', 'The Coca-Cola Company', 'KIPP Metro Atlanta Schools', 'Spartan Technologies', "Children's Healthcare of Atlanta", 'ARK Solutions', 'Anthem', 'City of Forest Park, GA', 'Atrium CWS', 'Salesforce', 'Sovos Compliance', 'Southern Poverty Law Center', 'Baer Group']

Extracting snippets

Snippets are couple of sentences of text that briefly explain the job postings. Along with job title and company name, these are one of the most important pieces of information to extract from each Indeed job posting result.

# extracting snippets from each job postings snippet_src = soup.find_all('div', {'class', 'summary'}) snippet_list = [] for val in snippet_src: snippet_list.append(val.get_text()) snippet_list[:3] # Output ['\n\nMaintain and troubleshoot the integrity of data linkages between data source systems and data warehouse.\nCollaborate with other Data Team members to develop and...\n', '\n\nCreates and maintains a data dictionary and meta data.\nAnalyzing data reporting data for clinical outcomes, qualitative and other types of research.\n', '\n\n3 years of work experience in creation, reporting, and/or management of data or closely related tasks (not including data entry).\n']

Converting into CSV file

You can take the lists above, and read it as a pandas DataFrame. Once you have the Dataframe, you can convert to CSV, Excel or JSON easily without any issues.

Scaling up to a full crawler for extracting all Indeed job postings

  • Once you scale up to make thousands of requests to fetch all the pages, the indeed.com servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.
  • To make it more likely to successfully fetch data for all USA, you will have to implement:
  • rotating proxy IP addresses preferably using residential proxies.
  • rotate user agents
  • Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

--

--

We cover all the cutting edge natural language processing, machine learning and AI powered strategies to extract web data on big data scale.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jay M. Patel

Cofounder/principal data scientist at Specrom Analytics (specrom.com) natural language processing and web crawling/scraping expert. Personal site: JayMPatel.com