Selenium and BeautifulSoup WebScraping: Linkedin job description — Updated.

10 min readFeb 28, 2023

As a definition web scraping is a process that uses lines of codes that will access selected web pages and scrap all kinds of data you desire.

Updated Project link: https://github.com/Gabbyroba/webscrap-linkedin-jobs
This is an updated version of the original article by Saulo Toledo Pereira you can read the original by clicking here.

In this project, we will access the LinkedIn webpage, send login credentials, access the desired jobs position, scrap jobs requirements, and plot a wordcloud that shows us what are the most requested requirements by companies. Knowing these requirements, our studies can be focused on the most important topics.

To simplify the scrape process, we will use two python libraries:
- Selenium
- BeautifulSoup

Another important information is that if you’re using Python 3.11 you’ll need to work with a previous version (3.10.5) for this project since the wordcloud library seems to be outdated at the most recent version.

And you’ll also need to install Microsoft Visual Studio version 14 or the newest, depending on which is available for your system.

You can download it by clicking here.

Selenium

Selenium is a powerful python framework that permits us to make web browser automation with few line codes. To install Selenium write the command below on your terminal:

pip install selenium

Selenium needs a ‘web driver’ to work. Google for example ‘selenium chrome web driver download’ (change chrome for a browser that you want to use), download, and put the file in the same directory of your python script. In my case, I used google chrome, but you can do the same process with other browsers.

Check your browser version before downloading the proper file.

Beautifulsoup

We will use Beautfulsoup to get the data (text, tables, images, etc) from websites. Command to install:

pip install beautifulsoup4

This tool doesn’t work in a simple way when there are javascript functions in websites. That’s why we will use Selenium + BeautifulSoup, the first one will manipulate inputs, clicks, scrolls, and the second will get all data that we want.

Now, we are ready to start our code. First, we’ll import the python libraries that we’ll use:

The code below is for creating our variables, you can delete the examples and write your login email and password, this step is necessary for selenium to be able to enter the website. You can change ‘position’ which will make the script search for the job that you want, in this case, I wrote ‘data scientist’, and to ‘local’ I selected brazil like the example below.

With our variables ready, we can start to put the browser to work:

For selenium to open the browser, first we put the downloaded file before into a variable, then execute the command:

# select web driver file
driver_path = "chromedriver.exe"# select your browser (Chrome in this case), and it will open
driver = webdriver.Chrome(executable_path=driver_path)

To open a website with selenium just write the command:

driver.get('website link')

When I was coding this script, so many bugs were popping when running and I was making some changes according to the problem. The first of them was the time to load a page, the script tries to search for login and password inputs before the page has totally loaded. With this situation, the script stops with an error that says: these inputs that you told me to send your login and pass don’t exist. The code below, force the script to wait for x seconds until following the next step. In my case, two seconds is sufficient to completely load the webpage. This problem happened in others parts of the algorithm, then every time that you see the command time.sleep(x), it was because the same problem happened.

You need to keep in mind that the time of waiting may be different depending on your internet speed, so if the error keeps happening even after the command, you can adjust the time and test again.

time.sleep(x)

Another situation that brings me some error messages is the fact of selenium opens the browser window with a low resolution. With that, some pages completely change their functionalities, changing the layout, and hiding texts, buttons, and others. The lines below avoid this kind of error:

driver.set_window_size(1024, 600)driver.maximize_window()

Locating elements with selenium

Documentation here

To find the page elements is necessary some basic knowledge of HTML structures.

The Linkedin login page is simple, we have to find the email and password inputs, send the credentials e click on the sign-in button.

If we do right click on email input and select inspect a window on the right will appear:

This window shows us the HTML structure necessary to be read by our browser and thus, we can see the entire page and your functionalities.
The highlighted HTML code is our login input:

<input id="username" name="session_key" type="text" aria-describedby="error-for-username" required="" validation="email|tel" class="" autofocus="" aria-label="Email or Phone">

To locate elements with selenium, we have some command options:

find_element(By.ID, "id")
find_element(By.NAME, "name")
find_element(By.XPATH, "xpath")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.TAG_NAME, "tag name")
find_element(By.CLASS_NAME, "class name")
find_element(By.CSS_SELECTOR, "css selector")

In this case, the HTML code shows us the ‘id’ of the login input (highlighted below):

<input id="username" name="session_key" type="text" aria-describedby="error-for-username" required="" validation="email|tel" class="" autofocus="" aria-label="Email or Phone">

Then we can select the login input element using the command:

driver.find_element(By.ID,'username')

After input has been selected, we can send our login before stored in the variable ‘email’

driver.find_element(By.ID,'username').send_keys(email)

With this command, selenium will search for the input and will write your email. The same was done to the password input. After all, just send the command to selenium press the ‘return’ on our keyboard and we are done!

driver.find_element(By.ID,'password').send_keys(Keys.RETURN)

After login, we can open the jobs page:

# Opening jobs webpage
driver.get(f"https://www.linkedin.com/jobs/search/?currentJobId=2662929045&geoId=106057199&keywords={position}&location={local}")
# waiting load
time.sleep(x)

Getting data

Linkedin shows us a pattern of 40 jobs pages, then we can create a loop that will run 40 times. Thus, the script will get the job descriptions of all pages.

# loop that will run 40 times (for 40 pages)
for i in range(1,41):

In this script section, I found another problem, the number of jobs shown per page is different, in some, it may appear as 25, others 13 or 17, and there is no pattern. So, after opening the jobs page, it was first necessary to count how many jobs were appearing and then run a loop according to the number of jobs that appeared on the page. Otherwise, the script had an error because it would be trying to find an element that does not exist on the page, for example, the page shows 20 jobs, but our loop occurs 25 times, when the loop reached step 21, it would give an error and stop with everything.

# click button to change the job list
    driver.find_element(By.XPATH, f'//button[@aria-label="Page {i}"]').click()
    time.sleep(4)
    
    # each page show us some jobs, sometimes show 25, others 13 or 21 ¯\_(ツ)_/¯
    jobs_lists = driver.find_element(By.CLASS_NAME,
        'scaffold-layout__list-container')  # here we create a list with jobs
    
    jobs = jobs_lists.find_elements(By.CLASS_NAME,
        'jobs-search-results__list-item')  # here we select each job to count

After this, we can create another loop to run exactly on the job count value, and selenium can click on each job to get all job descriptions. Beautifulsoup will get all the description text, and then, we can store all data in a list.

Another problem happened when I noticed the LinkedIn layout has changed and it was blocking the reading of certain elements, so it was necessary to create a command for the code to click the URL of the job too.

# waiting load
    time.sleep(4)
    # the loop below is for the algorithm to click exactly on the number of jobs that is showing in list
    # in order to avoid errors that will stop the automation
    
    for job in range(1, len(jobs)+1):
        
        # job click
        driver.find_element(By.XPATH, f'/html/body/div[5]/div[3]/div[4]/div/div/main/div/section[1]/div/ul/li[{job}]').click()
        time.sleep(3)
        driver.find_element(By.XPATH, f'/html/body/div[5]/div[3]/div[4]/div/div/main/div/section[1]/div/ul/li[{job}]/div/div[1]/div[1]/div[2]/div[1]/a').click()        
        # waiting load
        time.sleep(3)
        # select job description
        job_desc = driver.find_element(By.ID, 'job-details')
        # get text
        soup = BeautifulSoup(job_desc.get_attribute'outerHTML'), 'html.parser')
        # add text to list
        disc_list.append(soup.text)

Cleaning data

After the script has got all descriptions jobs, we will put all data in a data frame and begin ‘to clean’ useless data that we don’t will use.

In this scenario, we’ve got jobs in English and Portuguese, thus, using regular expressions, we can select all text until the certain string, for example:

[...] "Most companies try to meet expectations, dunnhumby exists to defy them. Using big data, deep expertise and AI-driven platforms to decode the 21st century human experience – then redefine it in meaningful and surprising ways that put customers first. Across digital, mobile and retail. For brands like Tesco, Coca-Cola, Procter & Gamble and PepsiCo.

We’re looking for Senior Applied Data Scientist who expects more from their career. It’s a chance to apply your expertise to distil complex problems into compelling insights using the best of machine learning and human creativity to deliver effective and impactful solutions for clients. Joining our advanced data science team, you’ll investigate, develop, implement and deploy a range of complex applications and components while working alongside super-smart colleagues challenging and rewriting the rules, not just following them.What We Expect From You:
-Degree in a relevant subject
-Programming skills (Hadoop, Spark, SQL, Python)
-Prototyping
-Statistical Modelling
-Analytical Techniques and Technology
-Quality Assurance and TestingWe won’t just meet your expectations. We’ll defy them. So you’ll enjoy the comprehensive rewards package you’d expect from a leading technology company. But also, a degree of personal flexibility you might not." [...]

With the description above, we want to delete max useless information as possible, then we can find a word that serves as a trigger that represents that from that point the requirements for the job in question will be shown. In this case, we can see that after the word ‘Expect’, the requirements are shown to us.

So we can create a dataframe with an empty list, add the word starts to the list, and then create a loop that will search the words inside it and replace it with an empty string, thus deleting everything we won’t need to use.

# Creating a Dataframe with list
df = pd.DataFrame(disc_list)

# Setting the word starts
word_list = ['Expect', 'Qualifications', 'Required', 'expected', 'Responsibilities', 'Requisitos', 'Requirements', 'Qualificações', 'QualificationsRequired1', 'você deve ter:', 'experiência', 'você:', 
             'Desejável', 'great', 'Looking For', 'll Need', 'Conhecimento', 'se:', 'habilidades', 'se:', 'REQUISITOS']
# deleting useless words
df = df.replace(f'\n', '', regex=True)

for i in range (0, len(word_list)):
    df = df.replace(f'^.*?{word_list[i]}', '', regex=True)

Applying this code in our example, the output will be:

Expect From You:
-Degree in a relevant subject
-Programming skills (Hadoop, Spark, SQL, Python)
-Prototyping
-Statistical Modelling
-Analytical Techniques and Technology
-Quality Assurance and TestingWe won’t just meet your expectations. We’ll defy them. So you’ll enjoy the comprehensive rewards package you’d expect from a leading technology company. But also, a degree of personal flexibility you might not." [...]

Much better right? We cleaned a lot of useless words. But the jobs listed don’t have a pattern in their descriptions, then we have to put a list with the most used words like qualifications, requirements, experience, looking for, and others. If you are searching in other languages, you will need to write more words for each language.

Word Cloud

DataFrame cleaned! What we can do now? our word cloud with the most relevant words.

The code below will build our word cloud, the variable ‘badwords’ is the rest of the words that will be difficult to clean on the previous step and aren’t relevant to our final plot.

# setup wordcloud
stopwords = set(STOPWORDS)
# selecting useless words
badwords = {'gender', 'experience', 'application', 'Apply', 'salary', 'todos', 'os', 'company', 'identity', 'sexual', 'orientation',
            'de', 'orientação', 'sexual', 'gênero', 'committed', 'toda', 'client', 'conhecimento',
            'world', 'year', 'save', 'São', 'Paulo', 'information', 'e', 'orientação', 'sexual', 'equal', 'oppotunity', 'ambiente', 'will',
            'Experiência', 'national origin', 'todas', 'work', 'de', 'da', 'years', 'pessoa', 'clients', 'Plano', 'creating',
            'employer', 'saúde', 'em', 'working', 'pessoas', 'mais', 'data', 'people', 'dia', 'one', 'knowledges', 'plataforma',
            'ou', 'benefício', 'para', 'software', 'opportunity', 'tecnologia', 'você', 'mais', 'solution', 'national', 'origin',
            'trabalhar', 'option', 'negócio', 'empresa', 'o', 'sicence', 'team', 'é', 'veteran', 'status', 'etc', 'raça', 'cor', 'belive',
            'nossa', 'uma', 'como', 'Scientist', 'ferramenta', 'projeto', 'que', 'job', 'benefícios', 'knowledge', 'toll', 's', 'modelo',
            'desconto', 'cultura', 'serviço', 'time', 'se', 'solutions', 'mercado', 'das', 'somos', 'problema', 'mundo', 'race', 'color',
            'vaga', 'pelo', 'ser', 'show', 'Seguro', 'Se', 'um', 'Um', 'tool', 'regard', 'without', 'make', 'ao', 'técnica', 'life',
            'interested', 'diversidade', 'proud', 'ability', 'sobre', 'options', 'using', 'área', 'nosso', 'na', 'seu', 'product', 'produto',
            'building', 'skill', 'model', 'religion', 'Share', 'receive', 'consideration', 'Aqui', 'vida', 'ferramentas', 'Vale', 'Refeição',
            'Strong', 'Pay', 'range', 'available', 'part', 'trabalho', 'Alimentação', 'employment', 'qualified', 'applicants', 'gympass',
            'está', 'comprometida', 'forma', 'Transporte', 'Yes', 'gente', 'melhor', 'lugar', 'believe', 'moment', 'próximo', 'deasafio',
            'dos', 'oportunidade', 'idade', 'new', 'Try', 'Premium', 'deficiência', 'sempre', 'criar', 'employee', 'problemas', 'unavailable',
            'Brasil', 'dado', 'hiring', 'trends', 'equipe', 'recent', 'temos', 'build', 'career', 'nós', 'diferencial', 'ma',
            'total', 'oferecemos', 'contato', 'tem', 'não', 'free', 'Full','global','computer','crossover','great','plus','customer','sua','including'
            'per week','help','including','check','grow','day','Cliente','Customer','processo','position','email','e-mail','platform','application'}

# deleting the useless words on plot
stopwords.update(badwords)

# plot parameters
wordcloud = WordCloud(background_color='black',
                      width=1600, height=800, 
                      stopwords=stopwords,
                      max_words=100,
                      max_font_size=250,
                      random_state=42).generate("".join(df[0]))

# Plot
plt.tight_layout(pad=0)
plt.imshow(wordcloud, interpolation='bilinear')
plt.savefig('wordcloud-job.png', dpi=300)
plt.axis("off")

# exporting our dataframe to a csv file
df.to_csv('wordcloud-job.csv', sep=';')

And finally, our word cloud plot and export our data to a csv file. I’ve also added a command to save the image as png automatically, but that’s optional.

The image above shows us that the selected job position (data scientist) must have some skills like:

knowledge in machine learning,
python,
data analysis,
development,
code,
frameworks,
business,
SQL,
AWS,
and others.

We did it! 😄

I hope this article was helpful in your scrape journey. If you have any corrections, suggestions, or information about this topic, feel free to contact me:

gabbyramosbr2@gmail.com

Thanks for reading! 😊

Credits and references to the original project created by Saulo Toledo Pereira

Updated Project link: https://github.com/Gabbyroba/webscrap-linkedin-jobs
Original Project link: https://github.com/saulotp/linkedin-job-description-scrap