Web Scraping Indeed.fi for Key Job Skills in Finland
As many of you probably know, being a data scientist requires a large skill set.
To master all of that at a high level would probably take a lifetime!
So which of these skills are most employers actually looking for in Finland (I am currently living in Helsinki, Finland?
To answer that question, I am going to scrape job postings from Indeed.fi for data science jobs and see which skills employers want the most (Python or R? Are they interested in Spark yet? How dominant are NoSQL databases? Are they using proprietary software like SAS or are companies preferring open source now?) To make it even better, I will create the program so that I can have a detailed breakdown by city and by job title.
Program Set Up
The basic workflow of the program is
- Enter the city and the job title we want to search for jobs skills in matching (in quotes so it is a direct match) on Indeed.fi
- See the list of job postings displayed by the website
- Access the link to each job posting
- Scrape all of content in the job posting
- Filter it to only include words
- Reduce the words to a set so that each word is only counted once
- Keep a running total of the words and see how often a job posting included them
The program is written in Python 2.7 using the Jupyter Notebook .
I will create two functions:
- The first will scape an individual job posting for the HTML, clean it up to get the words only, then output the final list of words.
- The second will manage which URLs to access via the job postings Indeed’s website links to, count the required skills and plot the skill frequency as the output.
Import the necessary library
In this post, I will use the urllib2 library to connect to the websites, the BeautifulSoup library to scrape the page content, the re library for parsing the words and filtering out other markup based on regular expressions, and pandas to manage and plot the final results
from bs4 import BeautifulSoup
from time import sleep
from collections import Counter
from nltk.corpus import stopwords
import pandas as pd
First Function: text_cleaner(website):
This function will be called every time we access a new job posting. Its input is a URL for a website, while the output will be a final set of words collected from that website.
Second Function: skill_info_fi(city = None, job = None):
This function will take a desired city and look for all new job postings on Indeed.fi. It will crawl all of the job postings and keep track of how many use a preset list of typical data science skills. The final percentage for each skill is then displayed at the end of the collation.
- Inputs: The location’s city and job. These are optional. If no city/state is input, the function will assume a national search for data scientist job (this can take a while!!!). Input the city/state as strings, such as skills_info(city = ‘Helsinki’, job = ‘Data Analytics’).
- Output: A bar chart showing the most commonly desired skills in the job market for a job title. Besides that, the function also export the plot to PNG file for later use
Let’s now try running our new function on Helsinki, Espoo and Suomi (mean Finland for nationwide result) for 3 job in ‘Data scientist’, ‘Data analytics’ and ‘ machine learning’ to see what results we get. Just as a note, all of these results were run on September 11, 2018 .
Data Scientist Job
- Helsinki: There were 19 Data Scientists jobs found in Helsinki
- Espoo: There were 19 Data Scientists jobs found in Espoo
- Nationwide: There were 23 Data Scientists jobs found in Nationwide
Data Analytics Job
- Helsinki: There were 25 Data Analytics jobs found in Helsinki
- Espoo: There were 25 Data Analytics jobs found in Espoo
- Nationwide: There were 38 Data Analytics jobs found in Nationwide
Machine Learning Job
- Helsinki: There were 83 Machine Learning jobs found in Helsinki
- Espoo: There were 91 Machine Learning jobs found in Espoo
- Nationwide: There were 107 Machine Learning jobs found in Nationwide
There are not many job listing related to ‘Data analytics’, ‘Data scientists’ and ‘Machine learning’ in Indeed.fi (compared to Monster.fi or Indeed.com) . The next post , I will try to scrape content on some other major job listing websites in Finland to analyse and have more information.
The Jupyter Notebook and all the plot can be found in my GitHub