Selenium and BeautifulSoup WebScraping: Linkedin job description

Saulo Toledo Pereira
Analytics Vidhya
Published in
9 min readAug 22, 2021

--

Project link: https://github.com/saulotp/linkedin-job-description-scrap

This article was updated by

and can be read here.

We’ve worked together to make this project work again. Some parts of this article don’t work anymore, but in github you can check the updated code. For more information visit the updated article!

An important skill to us, Data Science students, is to get data, in this case, from websites. Also called web scraping, this process uses lines of codes that will access selected web pages and scrap all kinds of data you desire.

In this project, we will access the LinkedIn webpage, send login credentials, access data scientists jobs, scrap jobs requirements, and plot a wordcloud that shows us what are the most requested requirements by companies. Knowing these requirements, our studies can be focused on the most important topics.

This project will serve both for me to practice my python skills and to help anyone interested in learning the art of web scrap.

To simplify the scrape process, we will use two python libraries:
- Selenium
- BeautifulSoup

Selenium

Selenium is a powerful python framework that permits us to make web browser automation with few line codes. To install Selenium write the command below on your terminal:

pip install selenium

Selenium needs a ‘web driver’ to work. Google for example ‘selenium chrome web driver download’ (change chrome for a browser that you want to use), download and put the file on the same directory of your python script. In my case, I used google chrome, but you can do the same process to others browsers.

Check your browser version before downloading the proper file.

Beautifulsoup

We will use Beautfulsoup to get the data (text, tables, imgs, etc) from websites. Command to install:

pip install beautifulsoup4

This tool doesn’t work in a simple way when there are javascript functions in websites. That's why we will use Selenium + BeautifulSoup, the first one will manipulate inputs, clicks, scrolls and the second will get all data that we want.

Now, we are ready to start our code. First, we’ll import the python libraries that we’ll use:

The code below is for create our variables, you can delete the examples and write your login email and password, this step is necessary for selenium to be able to enter on the website. You can change ‘position’ that will make the script search for the job that you want, in this case, I wrote ‘data scientist’, and to ‘local’ I selected brazil like the example below.

With our variables ready, we can start to put the browser to work:

For selenium to open the browser, first we put the downloaded file before into a variable, then executes the command:

# select web driver file
driver_path = "chromedriver.exe"
# select your browser (Chrome in this case), and it will open
driver = webdriver.Chrome(executable_path=driver_path)

To open a website with selenium just write the command:

driver.get('website link')

When I was coding this script, so many bugs were popping when running and I was making some changes according to the problem. The first of them was the time to load a page, the script tries to search for login and password inputs before the page has totally loaded. With this situation, the script stops with an error that says: these inputs that you told me to send your login and pass don't exist. The code below, force the script to wait for x seconds until following the next step. In my case, two seconds is sufficient to completely load the webpage. This problem happened in others parts of the algorithm, then every time that you see the command time.sleep(x), it was because the same problem happened.

time.sleep(x)

Another situation that brings me some error messages is the fact of selenium opens the browser windowed with a low resolution. With that, some pages completely change their functionalities, changing the layout, hiding texts, buttons, and others. The lines below avoid this kind of errors:

driver.set_window_size(1024, 600)driver.maximize_window()

Locating elements with selenium

Documentation here

To find the page elements is necessary some basic knowledge of HTML structures.

The Linkedin login page is simple, we have to find the email and password inputs, send the credentials e click on sign in button.

If we do right click on email input and select inspect a window in the right will appear:

This window shows us the HTML structure necessary to be read by our browser and thus, we can see the entire page and your functionalities.
The highlighted HTML code is our login input:

<input id="username" name="session_key" type="text" aria-describedby="error-for-username" required="" validation="email|tel" class="" autofocus="" aria-label="Email or Phone">

To locate elements with selenium, we have some command options:

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

In this case, the HTML code shows us the ‘id’ of login input (highlighted below):

<input id="username" name="session_key" type="text" aria-describedby="error-for-username" required="" validation="email|tel" class="" autofocus="" aria-label="Email or Phone">

Then we can select the login input element using the command:

driver.find_element_by_id('username')

After input has been selected, we can send our login before stored in variable ‘email’

driver.find_element_by_id('username').send_keys(email)

With this command, selenium will search to the input and will write your email. The same was did to the password input. After all, just send the command to selenium press the ‘return’ on our keyboard and we are done!

driver.find_element_by_id('username').send_keys(Keys.RETURN)

After login, we can open the jobs page:

# Opening jobs webpagedriver.get(f"https://www.linkedin.com/jobs/search/?currentJobId=2662929045&geoId=106057199&keywords={position}&location={local}")# waiting 
loadtime.sleep(2)

Getting data

Linkedin shows us a pattern of 40 jobs pages, then we can create a loop that will run 40 times. Thus, the script will get the job descriptions of all pages.

# loop that will run 40 times (for 40 pages)
for i in range(1,41):

In this script section, I found another problem, the number of jobs shown per page is different, in some, it may appear 25, others 13 or 17, there is no pattern. So, after opening the jobs page, it was first necessary to count how many jobs were appearing and then run a loop according to the number of jobs that appeared on the page. Otherwise, the script had an error because it would be trying to find an element that does not exist on the page, for example, the page shows 20 jobs, but our loop occurs 25 times, when the loop reached step 21, it would give an error and stop with everything.

# here we create a list with jobs that is showing
jobs_lists = driver.find_element_by_class_name('jobs-search-results__list')
# and here we select each job to count
jobs = jobs_lists.find_elements_by_class_name(‘jobs-search-results__list-item’)
# thus we can only len(jobs) to find how many jobs are showing to us # in each page

After this, we can create another loop to run exactly on the job count value, and selenium can click on each job to get all job descriptions. Beautfulsoup will get all description text, and then, we can store all data in a list.

for job in range(1, len(jobs)+1):
# job click
driver.find_element_by_xpath(
f'/html/body/div[5]/div[3]/div[3]/div[2]/div/section[1]/
div/div/ul/l i[{job}]/div/div/div[1]/div[2]/div[1]/a').click()

# select job description
job_desc = driver.find_element_by_class_name('jobs-
search__right-rail')
# get text
soup = BeautifulSoup(job_desc.get_attribute('outerHTML'),
'html.parser')
# add text to list
disc_list.append(soup.text)

Cleaning data

After the script has got all descriptions jobs, we will put all data in a data frame and begin ‘to clean’ useless data that we don't will use.

In this case, we got jobs with English and Portuguese languages, thus, using regular expressions, we can select all text until the certain string, for example:

[...] "Most companies try to meet expectations, dunnhumby exists to defy them. Using big data, deep expertise and AI-driven platforms to decode the 21st century human experience – then redefine it in meaningful and surprising ways that put customers first. Across digital, mobile and retail. For brands like Tesco, Coca-Cola, Procter & Gamble and PepsiCo.

We’re looking for Senior Applied Data Scientist who expects more from their career. It’s a chance to apply your expertise to distil complex problems into compelling insights using the best of machine learning and human creativity to deliver effective and impactful solutions for clients. Joining our advanced data science team, you’ll investigate, develop, implement and deploy a range of complex applications and components while working alongside super-smart colleagues challenging and rewriting the rules, not just following them.

What We Expect From You:
-Degree in a relevant subject
-Programming skills (Hadoop, Spark, SQL, Python)
-Prototyping
-Statistical Modelling
-Analytical Techniques and Technology
-Quality Assurance and Testing
We won’t just meet your expectations. We’ll defy them. So you’ll enjoy the comprehensive rewards package you’d expect from a leading technology company. But also, a degree of personal flexibility you might not." [...]

With the description above, we want to delete max useless information as possible, then we can find a word that serves as a trigger that represents that from that point the requirements for the job in question will be shown. In this case we can see that after the word ‘Expect’, the requirements are showed to us. Then we can use the following logical: ‘Python, read the job description and when you see the word ‘Expect’, delete everything that you read before.

df = df.replace(['^.*?Expect','', regex=True)

Applying this code in our example, the output will be:

Expect From You:
-Degree in a relevant subject
-Programming skills (Hadoop, Spark, SQL, Python)
-Prototyping
-Statistical Modelling
-Analytical Techniques and Technology
-Quality Assurance and Testing
We won’t just meet your expectations. We’ll defy them. So you’ll enjoy the comprehensive rewards package you’d expect from a leading technology company. But also, a degree of personal flexibility you might not." [...]

Much better right? we cleaned a lot of useless words. But the jobs listed don't have a pattern in their descriptions, then we have to put a list with the most used words like qualifications, requirements, experience, looking for, and others. If you are searching in other languages, you will need to write more words for each language.

Word Cloud

DataFrame cleaned! What we can do now? our word cloud with the most relevant words.

The code above will build our word cloud, the variable ‘badwords’ is the rest of the words that will be difficult to clean on the previous step and aren’t relevant to our final plot.

And finally, our word cloud plot and export our data to a csv file.

The image above shows us that a data scientist must have some skills like:

  • knowledge in machine learning,
  • python,
  • data analysis,
  • development,
  • code,
  • frameworks,
  • business,
  • SQL,
  • AWS,
  • and others.

We did it ! xD

I hope this article was helpful in your scrape journey. If you have any corrections, suggestions, or information about this topic, feel free to contact me:

saulodetp@gmail.com

See ya.

This article was updated by

and can be read here.

We worked together to make this project works again. Some parts of this article don’t work anymore, but in github you can check the updated code. For more information visit the updated article!

Project link: https://github.com/saulotp/linkedin-job-description-scrap

--

--

Saulo Toledo Pereira
Analytics Vidhya

PhD student trying to learn some code and practice my English. Can we talk five minutes?