Job Market Trend Analysis

Hongri Jia
Passion for Data Science
6 min readMar 23, 2018

The purpose of this project is helping our client to build a talent database which contains background and experience information of people who work in the data science industry. The client hopes to use this database to select candidates for specific job postings and get insight of the job market trend. In this project, we conduct web scraping with Selenium, process the data with Pandas and visualize the data with some visualization libraries in python.

Project Background

Our client is an executive recruiter company focusing on data science related job positions. Previously, they had full-time employees working on finding potential candidates from the LinkedIn for the job positions they have. However, the efficiency of manual operation is pretty low and hiring more people for this purpose will be costly. Thus, they want to have a automatic data collection program and build a talent database, which can help them to know more about the job market trend and locate their target more efficiently.

Project Content

There are two main steps in the pipeline of our project:

  • Data collection from employment-oriented websites
  • Data analysis for job market trend
  • Develop machine learning models to match job candidate profiles with job descriptions

In this blog, I will mainly focus on the first two part. The third part will be introduced in a separate blog post which will be published later.

Data Collection

Our client used to visit every profile related to the job position they have and record the information of potential candidates manually. In this project, we are going to achieve automation for this process by web scraping with python and then save the whole HTML structure into a local text file.

Usually, people use a library named Beautifulsoup to achieve this purpose. Web scraping with Beautifulsoup extracts content from static HTML pages and it cannot be used to scrape some websites that don’t want others to obtain data from them, such as LinkedIn. To solve this issue, we use a more advanced package called Selenium to mimic a real human user interacting with the web pages. It directly controls the browser by programmatically clicking links and filling in login information, which reduces the probability of being blocked by LinkedIn.

The video below is the demo of the automatic process.

The first challenge we met is that we have to make sure each action our program achieved is exactly the same as the real operations by humans. For example, if you don’t set enough of a waiting time to make sure the web page is loaded fully, you will lose part of the information. And if you want to click on the links or buttons in the web page, you have to make sure they are viewable in your browser. Otherwise, LinkedIn will regard you as a robot and block your access since humans cannot click on anything invisible.

The most challenging part is from the LinkedIn website. It has various methods to prevent external scraping. First, each Linkedln account can only access the first 100 pages of the search results. Second, some profiles may appear repeatedly in different result pages, which causes duplicate records occurring in our data. These problems limit the amount of data we can collect.

This issue is solved by networking method. We found every profile page has a section called “People Also Viewed”. This section contains a list of ten people who work in similar fields with the profile owner. We can regard the profile we have already collected as seeds and then extend the network with this connection. In this way, LinkedIn won’t set a visit limitation for the profiles and the amount of data is grows exponentially.

Figure-1 “People Also Viewed” Section on the LinkedIn Profile

Data Analysis

Before analyzing the data, we have to extract the useful information from the raw data we get from last step since the raw data is saved as HTML structure in text files. As I mentioned before, Beautifulsoup can extract content from static HTML pages directly. Thus, we apply it here to help us with this purpose. Then, the data is placed into a Pandas data frame. Pandas is a python library designed for data manipulation and analysis. A Data frame is a type of data structure, which is the same as a table. Pandas provides many powerful functions to process the data frames. Based on this, we can conduct a variety of analysis. Figure-1 shows the outputs of the data preprocessing.

Figure-2 Pandas Data Frame with the Information on the LinkedIn Profiles

Now we can do some job market trend analysis for our client in the following aspects. First, we made a wordcloud plot to list top skills for data scientists in Toronto. We can find the skillsets of data scientists in Toronto mainly contains several hard skills like Python, R, SQL and machine learning. In other words, if a person is considered as the candidate for the data scientist position in Toronto, he should be proficient in these skills.

Figure-3 Skillset of Data Scientists in Toronto

Second, we built two pie charts and bar plot to explore their education levels and background, respectively. Figure-4 shows the major part of data scientists in Toronto have master degree. Additionally, we can see that companies prefer employees with graduate degrees.

Figure-4 Education Level of Data Scientists in Toronto

As for the background, our client wants to know if the level of relevance between a candidate’s background and data science matters when companies are choosing their employees. We set computer science, statistics and some similar majors as data science related backgrounds. If a person only studied in these fields, he is labelled as “All related”. And if he studied both in data science related and unrelated fields, he is labelled as “Existing related”. According to our result, we can say that the major is not a very important factor in the recruitment process. This may be because data science is a pretty new field and companies focus more on an employee’s skills.

Figure-5 Background Relevance of Data Scientists in Toronto

Then, we help the client to explore the number of people that moved to Toronto from other cities to be a data scientist. This figure tells our client that more than half of data scientists didn’t work in Toronto for their previous jobs. This is because the field of data science is rapidly developing in Toronto and attracts lots of people. It also means our client should expand their candidate hunting scale if they want to optimize their choice.

Figure-6 Movement Condition of Data Scientists in Toronto

Besides, this bar chart is created for showing the distribution of data scientists in terms of companies in Toronto. This suggests that they should establish a connection with these top companies like Scotiabank, RBC and Capital One if they want to expand their business.

Figure-7 Top Companies having Data Scientists in Toronto

Finally, we conducted an industry analysis helping the client understand which industry has a higher demand for data scientists. From Figure-8, it is shown that various industries are hiring data scientists. Banking, technology and software companies have relatively high demand for data scientists. So the recruiter can focus more on these industries when looking for the ideal candidates.

Figure-8 Industry Distribution of Data Scientists in Toronto

Based on the analysis above, our client obtained a deeper insight into the data science industry, which provides more hints to target candidates. The next step is to develop machine learning models to match job candidate profiles with job descriptions, which can offer a search function for recruiting purposes. This part will be introduced in a later post. If you are interested in the technical details of how this project is implemented, please feel free to contact me. At the meantime, if you want to know more about what students learn from WeCloudData’s data science courses, check out this website:

www.weclouddata.com

--

--