Build a Jobs Database using Indeed’s API

Alberto Mendonça E Moura
12 min readMar 14, 2018

--

Automating Your Job Search with Python (Part 1). Next: Building a Web Scraper to improve our Database.

For most of us, humble peasants of the real world, finding a job is an unavoidable catch that comes with adult existence. Part of an endless pursuit for an ever better life. Or, simply put: mere survival.

If you’re new to Automation buckle up. You are in for a treat! Remember those endless hours of employment hunt in your younger years? The relentless copy and pasting, the link clicking marathons, the back and forth email frenzy? Let’s make all this a thing of the past. Once and for all.

This article is part of a web series on Automating Your Job Search with Python.

In this Article:

We will be gathering job offers from the world’s biggest job aggregator website: Indeed.com.

Part 1: Building a Jobs Database using Indeed’s API

Part 2: Building a Web Scraper to improve our Database.

Next in these Web Series:

Automating your full job submission process. (Coming late March, 2018)

But before we begin, a solid word of advice:

Getting Started

In the first part of the project we will be using the following tools:

Before we start coding our application you should install Python and the remaining libraries mentioned above. We shall be using Python 3 throughout these series. This article assumes that you have MySQL up and running on your machine. If that’s not your case, please follow these installation guidelines.

Installing Python:

  • Windows users should install Python through the official website.
  • For Mac users, a set of Python versions comes pre-installed in OS X. The best way to manage these different versions is to install homebrew. Then open your terminal and type:
$ brew update
$ brew install python3

Now you can run Python 3.5 by invoking python3. If you prefer to set python3 as your default Python version, please follow these instructions.

Installing BeautifulSoup:

You should now have pip installed (be sure that you are using the Python 3 version of pip). If pip is not installed please follow this official guide. Now installing BeautifulSoup should be as easy as typing this into your Windows Cmd or Mac terminal:

$ pip install beautifulsoup4

Installing Requests and PyMySQL libraries:

Same goes for our last tools:

$ pip install requests
$ pip install pymysql

Using an API

It’s time to start fetching our jobs data from the biggest Job Aggregator website on the planet: Indeed.com. But before you head on to site inspection there is always something you should ask yourself first: Do they have an API? And, as it turns out, they do.

Application programming interfaces come in handy: they provide nice,
convenient interfaces between multiple disparate applications.(…) They are designed to serve as a lingua franca between different pieces of software that need to share information with each other.

— Ryan Mitchell, Web Scraping with Python (O’Reilly Media)

There are pros and cons of using an API to fetch data. They do provide a much more stable process for retrieving information than a web scraper. The latter may be using the HTML/CSS fields to capture data and therefore will crash whenever those front end labels are changed. Besides, API’s will usually provide info through an extremely regulated syntax (JSON or XML, rather than HTML). The downsides are the imposed query limitations and the fact that they might disappear overnight (whenever its owner decides to)!

But right now we’re on a lucky streak! There is even a Python Client Library available for this API. Please take your time to go through the README.md file and install it accordingly.

  • Next, we will be using Indeed’s API to gather job postings information based on search keywords and date.
  • Then we will be populating a MySQL database with the retrieved results.
  • Lastly, in Part 2 of this article, the main web scraping process: we will fetch data not provided through the client’s API and assign it to each job posting in our database.

Jump into the Code

Now let’s get going! We first need to create a Python script with the necessary API requests. This will be a very important code file. Call it indeed_scraper_api.py

#import Indeed Python API module
from indeed import IndeedClient

At this stage we need to be careful. You should read Indeed’s Terms and Conditions and make sure you do not infringe any of their rules. You do not want to have your IP blocked or, even worse, get into some kind of legal trouble.

You should also read Indeed’s API documentation in order to understand the code we will be creating.

Before you start to interact with their API you will need to register as a publisher on their website. They will then provide you with a publisher ID number. You’ll need it to initiate an IndeedClient object. Insert the publisher number that you have received after registration as a single parameter.

client = IndeedClient(publisher = ************3506)

Now we need to define the search parameters that we will be using. Let’s just say you were on the hunt for a Python Developer job in Austin, Texas.

parameters = {
'q' : "python developer",
'l' : "Austin, TX",
'sort' : "date",
'fromage' : "5",
'limit' : "25",
'filter' : "1",
'userip' : "192.186.176.550:60409",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)"
}

Hmm. What is going on here?

If you haven’t read the Indeed Python API documentation, take your time to go through the available parameters definitions. I will be explaining just a few important values here:

  • ‘l’ : Location. City, State abbreviation. This will be important, later on.
  • ‘limit’ : Maximum results returned by query. Our first API limitation! Anyway, we will be using the maximum setting of 25 (default is 10);
  • ‘filter’ : Filter duplicate results. 0 turns off duplicates filtering. Default is 1.
  • ‘useragent’ : The browser User-Agent of the end-user to whom the job results will be displayed. This field is required.

The location parameter is pretty straightforward. Bear in mind its formulation, this will be important later on. The useragent is the web browser description. Most websites use it to customize content given the capabilities of a particular device and its software. It is also used for privacy related issues. More importantly, you may have noticed an enforced query limit. This is a rule which you cannot break. Indeed will never provide more than 25 results per query.

Time to get back to defining the code . Let’s now define our main search function. Let’s call it get_offers()

#our main search function
def get_offers(params):

We are giving it params (our parameters) as a single argument. They will be needed to perform our search.

Now it’s time to start fetching those job offers! Let’s go!

def get_offers(params):

#perform search
search_results = client.search(**params)

In case you’ve forgotten it, client is the variable to which we assigned the Indeed’s API IndeedClient object. We also initiated it with our publisher number as a required argument. Here, we are using its search method and giving it our search parameters: client.search(**params) . This function is going to return an offers dictionary and assign it to search_results

Moving on, let’s now take care of each offer details. We need to parse each of these offer elements for later use. Remember: we do not need to watch for repeated results. Indeed’s API filter does this for us.

    #loop through each offer element
for elm in search_results['results']:

#let's parse the offer
offer = (elm['jobtitle'],
elm['formattedLocation'],
elm['snippet'],
elm['url'],
elm['indeedApply'],
elm['jobkey'],
elm['date'])

What do we have here? Each of our offers will be another single dictionary under its parent dict ‘results’ key. The single offer detail keys are pretty straightforward. Title, location, url, date… But, how about the others?

  • ‘snippet’ : initial part of the job description string (trimmed, you will not get the full job description through their API);
  • ‘jobkey’ : the offer internal ID. This can be very useful later on, when you need to reference a specific offer;
  • ‘indeedApply’ : True or False. True if the offer has Indeed Easy Apply, false otherwise. Crucial for the future of this application.

Important: The offers that will be valuable to us are those that come with Indeed Easy Apply. Those where you can submit your proposal directly from Indeed. All the others just forward you to third party job boards or employer company websites.

Perfect. What should we do next? All this would be of no use to us if we could not store all this info. So it’s time to put our MySQL module to work and design the database.

Designing The Database

Open a new file in your editor and name it database.py

#we will be using this function to add our offers to the DB
def addToDatabase(offer):
#Open a database connection
db = pymysql.connect(host="localhost",
user="username",
password="my_password",
db="indeed",
charset="utf8")
#prepare a cursor object using cursor() method
cursor = db.cursor()

Ok, so we started by giving our addToDatabase function the offer argument. Makes sense. Then the pymysql.connect method connects to our database using the required arguments. Actually, charset is not mandatory, but you are highly advised to use it. Defining your character set will avoid errors when writing to your DB or while outputing to your console.

Add a function to write our offers into our database!

    try:
#execute SQL command
cursor.execute("INSERT INTO indeed(job_title,\
location, \
description, \
offer_url,\
offer_skills,\
indeed_apply, \
job_key, \
proposal, \
proposal_sent, \
date) \
VALUES ('%s', '%s', '%s', '%s', 'Undefined', \
'%s', '%s', 'None', 'False', '%s' )" % (offer))
#commit to database
db.commit()
except:
# Rollback in case there is any error
db.rollback()

If you are intrigued by the odd, capitalized syntax it’s time to get familiar with SQL commands.

Nevertheless, didn’t our offer had only seven elements? Yes, it did. We changed a couple of column names to make it more intuitive (ex. ‘snippet’ to ‘description’). Then we added three new columns: offer_skills , proposal , proposal_sent .

And we set those to 'Undefined' , 'None' , 'False' . We will need these new fields in the second part of this tutorial.

Alright, time to get back to our indeed_scraper_api.py file and import the pymysql module as well as the addToDatabasefunction. Let’s not forget to add the function call.

Here is our complete indeed_scraper_api.py file!

from indeed import IndeedClient#import MySQL Python Library
import pymysql
#from database.py file import function
from database import addToDatabase
client = IndeedClient(publisher = ************3506)parameters = {'q' : "python developer",
'l' : "Austin",
'sort' : "date",
'fromage' : "5",
'limit' : "25",
'filter' : "1",
'userip' : "192.186.176.550:60409",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)"
}
#our main function
def get_offers(params):
search_results = client.search(**search_params) for elm in search_results['results']:

offer = (elm['jobtitle'],
elm['formattedLocation'],
elm['snippet'],
elm['url'],
elm['indeedApply'],
elm['jobkey'],
elm['date'])
#add offer to DB (call to our function)
addToDatabase(offer)

Now it’s time to start writing your job offers into your database. You just need to run your main function:

get_offers(params)

Et voilà! There you have it.

Uff! Is this finally complete?

Well, it could be. But why don’t we take this a step further? Imagine that relocation was not an issue for you. You then wanted to follow all Python developer jobs in the United States!

Going a Step Further:

Well, if you wanted to get all jobs posted for Python Developer in the US you would have two issues:

  • Query Limit: Indeed is only going to provide a maximum of 25 results per query;
  • Location parameter: Indeed requires you to specify a single city per search query.

Well, would you search every American city individually? And be content with the limited results you got? Of course not! We can do better than that!

Let’s think about this one for a second. Could we send multiple queries until we have all the results that we need? It’s a valid idea and you could certainly try that. Unfortunately you would soon realize that there is also another unwritten rule under the hood: query limits. Time for a reality check, my good friend: API’s are not created to just hand out what you want. They exist to provide what its creator wants.

Well, well, well… are we stuck then?

No. Not at all. Look closely at the search parameters we defined at the beginning of our tutorial. Is there something that we could use to our advantage? Let’s open one of Indeed’s city offers page and see if we can find some hints.

San Francisco Indeed Offers Page

Hmm. A job offers column in the center, an email subscription form on the right and a left column with job search options (a.k.a parameters). Also, some very useful search options at the top.

Well, how are they sorting this job list? Let’s scroll down and find out.

Exactly. They are sorting it by relevance. You would need to click the date option to get the jobs list sorted by date posted.

As you’ve probably noticed, we have this same option in our parameters function. If we send a query with the sort parameter set to order by date 'sort':'date' , what will we get? That’s right, the last 25 job offers posted in that city page.

Here goes a simple, yet totally legal idea: what if we queried each city page n times per day in order to fetch all offers posted within those 24 hours? Can we do that?

Yes, we can.

Sadly, we don’t have enough time to implement this feature on this tutorial. If you do your research you will be able to create an automated solution for running your script multiple times during each given day!

Well, how about searching all offers in the country?

If you inspect our website closely, you will find that there is a page that lists all the US states.

Indeed Browse Jobs Page

By clicking each state link you will be provided with another page containing all the main cities within that State. Each of them will in turn have another link to each specific city offers page.

That said, let’s start scraping those state URLs first. Our first taste of web scraping! Let’s create a Python file and call it indeed_get_cities.py

#we import our modules
from bs4 import BeautifulSoup
from urllib.request import urlopen
#we define the variables for our URL sources
BASE_URL = "http://www.indeed.com"
states_URL = "http://www.indeed.com/find-jobs.jsp"
#the lists that will contain our States and Cities
states_URL_list = []
cities_name_list = []
#our function for fetching and creating the states list
def getStateLinks(states_URL):
#get the page source from the states URL page
html = urlopen(states_URL).read()

#create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
#we use a Bs method to find the link section
states_page = soup.find_all(id="states")
#then a loop to get all states links
#and we construct each state link using the relative ones
for states in states_page:
links = states.findAll('a')
for a in links:
if a.text != '':
states_URL_list.append(BASE_URL + a['href'])
#finally we return the States URL list
return states_URL_list

Now we have a function that returns all the state pages URL’s. Again, inside each page you will find the link to each of their cities.

Here’s how we are going to find each of the city names and the state to which they belong. If you inspect each city link:

You will see that the city name and state abbreviation are both part of the relative URL string. We just need to get all those URL strings and extract the two elements.

#the function to get those CITY, STATE names to use in our API    
def getCityNames():
#get the states URL list
states_URL_list = getStateLinks(states_URL)
#loop through all state pages
for page in states_URL_list:
html = urlopen(page).read()
soup = BeautifulSoup(html, "lxml")
#use Bs to find the cities relevant HTML elements
cities_page = soup.find_all('p', attrs={'class':'city'})
#loop through each element to get the city URL
for p in cities_page:
links = p.findAll('a')
#open a txt file to store city names
f = open('cities','a', encoding='utf8')
#get each city state link
for a in links:
city_state = a['href']
#parse CITY, STATE names using url string
if city_state[:5] == '/jobs' or '%' in city_state:
f.write(a.text + '\n')
#parse CITY, STATE and format as needed
else:
city_state = city_state.lstrip('/l-').replace('-', ' ').split(',')
city = city_state[0]
state_raw= city_state[1]
state = ''

for char in state_raw:
if char.isupper():
state += char
#join CITY, STATE abbreviation strings
location = city + ', ' + state
#write them to the file
f.write(location + '\n')

#close file
f.close()

Run our code and you should now have a text file with all your needed US cities and their states! Not that hard, right?

Now let’s go back to our indeed_scraper_api.py file and add our final function.

def searchAllCities():    #city counter
current_city = 0
#open cities text file
with open('cities', 'r', encoding='utf8') as myfile:

#get all cities into a list
locations = myfile.read().split('\n')
#get total cities number
city_number = len(locations)
#loop through all cities
while current_city < city_number:

#define city search location
params['l'] = locations[current_city]

#run main search function and get offers
get_offers(params)
#update city counter
current_city += 1

There you have it! Now, you just need to run searchAllCities() and watch you database fill up with job offers!

Coming Next:

On part 2 of this tutorial I will be showing you how to improve this job offers database with the help of a Web Scraper. We will build upon this small sample and get you up and running with Python web scraping! (Coming Soon)

--

--