How to Build a Job Scrapper Using Python

faith ojeabulu
The Startup
Published in
5 min readSep 19, 2020

Building a job scraper was quite technical and involved a little amount of brain work, with a little knowledge of HTML, CSS, and python, and it took me about 3- 5 days to completely get the codes running well. I never knew I could build something like this until I took the first step and with the help of my code coach. The job scraper can get the following from the website and store it an excel sheet ;

1)job title
2)link that redirects you to apply for the job
3)function
4)timestamp
5)salary
6)location etc.

CHOOSE AN EDITOR
The first thing you do is to choose the editor you are comfortable with, I used vs-code and sublime text, then you create a folder that will contain the project or files on the editor and save it (example.py).

REVIEW THE WEBSITE
Review the website you are about to scrap and write down the content you will like scrap like I listed above. To view the HTML content behind the website, right-click on any space on the website and you will get a drop-down menu and then click on “inspect". When you scroll through the HTML file you will notice that any line you click on, highlights something on the main website.

DOWNLOAD LIBRARIES
the first thing you will need in your code to run is python library such as beautiful soup, pandas, requests. so quickly go on you tube and check out how you can install them with your command prompt, so the editor will be
able to recognize and scrap the website efficiently. so install beautiful soup, requests, and pandas

APPLYING THE LIBRARY TO THE CODE

I’m going to explain what is required and what every code written performs
so the above project is broken down into 6 sections for better understanding ;

1) the first section brings out all the python library needed for the code to work. The library contains built-in modules that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers

2)the second section of the code get the link of the website you want to scrap and save it in variable and can prints out the HTML file in the editor. The editor can do that because library like request helps to get the HTML file from the browser and the beautiful soup helps to pull out data from the HTML file

3)the next part of the code gets the container( in my project the main container is ‘search-result’) holding all the content I wish to scrap from the website such as the job title, location, salary, timestamp, etc

4) In the fourth part of the project, a sub-container from the whole container on the website is reviewed and it has the index 0 because it is the first item in the container. From the code below the first line of code finds and prints out the items with job title from the sub-container in the main container “search-result” and this is done for others in the sub-content:

5)after getting all the required items for the first sub-container, the next thing you do is get every single item in the sub-container from the main container .you can do this using a for loop, while loop, or even list comprehension. This list comprehension loop through each item in each container, get them and puts it in the form of a list. ( artist, musician, teacher)

titles = [item.find(class_=’search-result__job-title’).get_text().strip() for item in items]

the image above shows the code finds all the job title in the main container( “search_result “.) and gets every item that is a job title in the container and prints it out and this is done for the other contents like salaries and locations

6)the next thing to be done is putting each item in form of a dictionary and store it in a variable which will be converted to a CSV file which can in turn be opened in an excel sheet(to display it in a readable format)

MAJOR PROBLEM ENCOUNTERED
one of the major problems I encountered was the code printed out excess white space and it was not readable, so I used the .strip() function to get rid of the excess space and I used Regex to substitute multiple white spaces with one white space.

IMAGE

POINTS TO NOTE

1) you can use the prettify() function to print out your HTML file in a readable form as shown in fig 2 line 10
IMAGE
2)Ensure you install and import the required python library(pandas, beautifulsoup, etc.)

--

--