Scraping Indeed job postings

Jay M. Patel
Web Data Extraction
4 min readMay 14, 2021
Photo by KOBU Agency on Unsplash

So, what is the easiest way to get all the job posting details from Indeed.com?

Option 1: Subscribe to Specrom’s Indeed Scraper API

We have an Indeed Scraper API that will extract all the pertinent job posting information such as company name, city, snippet, job title etc. by just specifying a search query and a location.

The API is available on RapidAPI and there is a free trial with no credit card information required. Paid plans start at just $10.

Option 2: Full service web scraping service.

If you just need job postings data as a CSV or excel file, than simply contact us for our full service web scraping service. You can simply sit back and let us handle all the backend issues to get the data you need.

Option 3: Scrape indeed.com on your own

Python is great for web scraping and we will be using a library called Selenium to extract Job postings from Indeed for Atlanta, GA.

### Using Selenium to extract Indeed.com's raw html source from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import Select import time from bs4 import BeautifulSoup import numpy as np import pandas as pd test_url = 'https://www.indeed.com/' option = webdriver.ChromeOptions() option.add_argument("--incognito") chromedriver = r'chromedriver.exe' browser = webdriver.Chrome(chromedriver, options=option) browser.get(test_url) text_area = browser.find_element_by_id('text-input-what') text_area.send_keys("Web scraping") text_area2=browser.find_element_by_id('text-input-where') text_area.send_keys("Atlanta, GA") element = browser.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button') element.click() html_source = browser.page_source browser.close()

Using BeautifulSoup to extract Indeed job postings

Once we have the raw html source, we should use a Python library called BeautifulSoup for parsing the raw html files.

Figure 1: Inspecting the source of HTML source code for Indeed.com search query webpage.

Extracting job titles

From inspecting the html source, we see that job titles have h2 tags and belong to class ‘title’.

# extracting job titles soup=BeautifulSoup(html_source, "html.parser") job_title_src = soup.find_all('h2', {'class','title'}) job_title_list = [] for val in job_title_src: try: job_title_list.append(val.get_text()) except: pass job_title_list #Output ['Data Analyst', 'Data Analyst', 'Data/Reporting Analyst', 'Sr. Data Analyst', 'Data Analyst', 'Data Analyst', 'Behavior Data Analyst - Marcus Autism Center - Behavioral Analysis Core', 'Data Analyst', 'Data Analyst', 'Police Analyst', 'Data Analyst (2021-1614)', 'Employee Data Analyst', 'Data Analyst', 'Data and Research Analyst', 'Data Analyst (13255)']

Extracting company names

The next step is extracting company names. We see that it is span tag of class ‘company’.

# extracting Indeed addresses company_name_src = soup.find_all('span',{'class', 'company'}) company_name_list = [] for val in company_name_src: company_name_list.append(val.get_text()) company_name_list # Output ['KIPP Foundation', 'Emory University', 'City of Atlanta, GA', 'The Coca-Cola Company', 'KIPP Metro Atlanta Schools', 'Spartan Technologies', "Children's Healthcare of Atlanta", 'ARK Solutions', 'Anthem', 'City of Forest Park, GA', 'Atrium CWS', 'Salesforce', 'Sovos Compliance', 'Southern Poverty Law Center', 'Baer Group']

Extracting snippets

Snippets are couple of sentences of text that briefly explain the job postings. Along with job title and company name, these are one of the most important pieces of information to extract from each Indeed job posting result.

As a reference, refer to the figure below.

Figure 2: individual job postings from Indeed.com search results.

# extracting snippets from each job postings snippet_src = soup.find_all('div', {'class', 'summary'}) snippet_list = [] for val in snippet_src: snippet_list.append(val.get_text()) snippet_list[:3] # Output ['\n\nMaintain and troubleshoot the integrity of data linkages between data source systems and data warehouse.\nCollaborate with other Data Team members to develop and...\n', '\n\nCreates and maintains a data dictionary and meta data.\nAnalyzing data reporting data for clinical outcomes, qualitative and other types of research.\n', '\n\n3 years of work experience in creation, reporting, and/or management of data or closely related tasks (not including data entry).\n']

Converting into CSV file

You can take the lists above, and read it as a pandas DataFrame. Once you have the Dataframe, you can convert to CSV, Excel or JSON easily without any issues.

Scaling up to a full crawler for extracting all Indeed job postings

  • Once you scale up to make thousands of requests to fetch all the pages, the indeed.com servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.
  • To make it more likely to successfully fetch data for all USA, you will have to implement:
  • rotating proxy IP addresses preferably using residential proxies.
  • rotate user agents
  • Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

After you follow all the steps above, you will realize that our pricing for managed web scraping or our Indeed scraper API is one of the most competitive in the market.

Originally published at https://www.specrom.com.

--

--

Jay M. Patel
Web Data Extraction

Cofounder/principal data scientist at Specrom Analytics (specrom.com) natural language processing and web crawling/scraping expert. Personal site: JayMPatel.com