Data Science job search: Using NLP and LDA in Python

Thomas Caffrey
Analytics Vidhya
Published in
9 min readMay 11, 2020

Scraping job adverts on Indeed and using topic modelling to find hidden topics in job postings

As someone recently in the market for a new data science job I’ve been looking at job adverts on a number of sites and thought it would be interesting to analyse the posts on Indeed and apply Latent Dirichlet Allocation (LDA) to the job descriptions. The topics generated by LDA should hopefully give an indication of the key data science skills sought after by employers.

In order to do this the following process was taken:

  1. Scrape data from Indeed using Beautiful Soup
  2. Analyse the job postings
  3. Clean the data and apply topic modelling

For now I’ve only scraped jobs in Greater London as that’s relevant to myself but I could also scrape jobs from other locations in the UK to increase the data set. All code used for this analysis can be found on Github.

1. Scraping Job Adverts

The information in each job advert on Indeed includes:

  • Job Title
  • Company
  • Location
  • Contract Type*
  • Salary*
  • Job Description

*not included in every job advert. Contract Type (permanent, contract etc.) was only provided 40% of the time and Salary approximately 50% of the time.

Scraping of the job adverts was done using the Beautiful Soup library and the code created for this is contained on the Github page. However a sample output below shows the information scraped:

2. Analysing the data

Volume of posts from Companies / Recruiters

Unsurprisingly most of the frequently positing companies are recruiters. Although in greater London it seems Harnham and Datatech Analytics have the largest number of roles. It may be worth speaking directly with these 2 to understand more about the roles they contain.

Removing duplicate postings

Sometimes recruiters or companies post the same advert for a job which results in duplicate data. These can simply be removed based on the job description however if competing recruiters are posting adverts for the same job there can be slight differences.

The solution that seemed the easiest to remove these closely matched jobs was to compare cosine similarities based on their word counts. Any that contained a score close to 1 without equalling 1 would almost certainly be a repetition of the same job.

Removing similar job descriptions with cosine similarity

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size. It measures the cosine of the angle between two vectors (containing word counts) projected in a multi-dimensional space, where each dimension corresponds to a word in the document. The cosine similarity captures the orientation (the angle) of the documents and not the magnitude.

The cosine similarity is advantageous because if two similar documents are far apart by the Euclidean distance due to their respective size they could still have a small angle between them, which indicates they are similar.

These job adverts are transformed to a vector based on their word counts before calculating their cosine similarities. This method will definitely catch any documents almost exactly the same but should also catch any that have had an extra section added and changed the length of the post (another recruiter posting the same job description in addition to information related to the recruitment company for example).

I made a few checks by inserting a reasonably long unrelated recruiter contact information into adverts that had a high cosine similarity with another (close to 1). With this “noise” the cosine similarity reduced but still returned a cosine similarity >0.99 which suggests these similar job description should still be picked up. However the Euclidean distance had a noticeable increase which indicates why it might not be worth using a distance measure to remove similar documents.

Some exploration on closely matched scores resulted in the cutoff of above 0.98 cosine similarity being used as a removal. This cutoff level captures adverts that have slight variations but are essentially the same job posting.

Job Titles

Data Scientist or machine learning engineer are the highest returning results for job title. However of the approximate 600 results returned from the search it seems there’s a lot of variation in titles used and approximately 400 job titles only used once.

For now I will class each job titles into the following categories:

  • Lead: Any title containing lead, chief, head or manager
  • Senior: Any title containing senior or principal
  • Graduate: Any title containing graduate
  • Regular: Will class anything else as a regular role

I’m sure there will be some roles that end up getting classed as regular when in reality they are a more senior or junior role. However this seems currently to be a better way to narrow down jobs based on titles. I at least know for certain if a role is Senior, Lead or Graduate that it is not required in the set of results I need for narrowing down my job search.

Contract Type

A mixture of contract and permanent roles are contained on Indeed. 40% of roles have a type of contract associated with them. It may be possible to further investigate the ones without a contract type and find out more from their description. For now I will class contract types as the following:

  • Contract: Anything that has ‘contract’ or ‘temporary’ associated with it
  • Apprenticeship
  • Internship
  • Part-Time
  • Permanent: Any contract that is classed as ‘Full-time’ or ‘Permanent’

Python vs R

Looking into whether Python or R is specified in the job description is more out of curiosity.

Python is overwhelmingly the program of choice in the job descriptions. This is higher than I thought it would be and makes the case for using Python over R for producing data science related analysis / models.

Salary

On Indeed permanent roles have a salary per year associated with them but for contractor roles the rate (hourly, daily,weekly, monthly) is specified.

The median salary for regular permanent roles is higher than I would expect, which leads me to think some senior roles are included in there but this seniority hasn’t been included in the tile. For contractor roles there seems to be an error in some of the salaries, which is shown in the roles between £0–200 a day. These roles have been uploaded as £400–500 a week when it likely they meant day (and would put them in line with the other roles).

3. Topic Modelling

Pre-processing text data

There is a wealth of material available regarding pre-processing techniques for text data. The method taken for this work involves:

  • Removing special characters & whitespaces
  • Converting to lowercase
  • Tokenising the document
  • Removing stop words
  • Lemmatising the tokens
  • Removing words that are only 1 character
  • Removing numbers but not words that are numbers

All of this pre-processing is done through the Python library NLTK (Natural Language Toolkit).

Using Gensim to create a dictionary and bag of word corpus

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora

Gensim requires the words (tokens) to be converted to unique ids, which can be done by creating a dictionary that maps the words to ids. Once the dictionary has been created a bag of words corpus can be created that contains the word id and the its frequency in each document. It is effectively an equivalent of a Document-Term matrix.

Topic modelling with LDA

Topic modelling is an unsupervised machine learning technique that detects words and phrases within a set of documents and clusters word groups that best represent the documents.

To summarise the process of LDA in a simple way:

  • The number of Topics to be used are selected
  • LDA will go through each word in each of the documents and assign to one of the K Topics.
  • The % of words within each document assigned to a topic are analysed.
  • For each word in a document the % of times that word has been assigned to a topic (over all the documents) is also analysed.

P(topic t | document d) = % of words in a document d that are currently assigned to Topic t

P(word w | topic t) = % of times the word w was assigned to Topic t over all documents

LDA will move a word w from Topic A to Topic B when:

p(topic A | document d) * p(word w | topic A) < p(topic B | document d) * p(word w |topic B)

After a specified number of passes, LDA “converges” to a more optimal state, where topic representations and documents represented in terms of these topics are more acceptable.

Coherence Score

The topic coherence of a model is determined using the following steps:

  1. Select the top n frequently occurring words in each topic.
  2. Calculate pairwise scores for each of the words selected above and generate the coherence score for each topic by aggregating these pairwise scores.
  3. The topic model score is calculated as the mean of the coherence scores per topic.

An approach to finding the optimal number of topics to build a variety of different models with different number of topics (k) and choose the model with the highest coherence score.

As can be seen from the graph the optimal number of topics is 9. The topics and associated keywords can be visualised with the excellent pyLDAvis package (based on the LDAvis package in R).

Each bubble on the plot represents one of the topics. The larger the bubble, the more prevalent the topic. A good topic model will have relatively large, non-overlapping bubbles scattered across the chart. From this and the coherence score its seems that this particular model can be improved (increasing the data size should yield improvements) however I would consider it a good starting point as a baseline model.

Topics Contained in the Job Adverts

2 of the 9 topics are related to recruiter information (such as equal opportunities or referral schemes). However the remaining topics are related to Data Science and cover the following areas:

  1. Analytics & Statistical Modelling
  2. Deep Learning, NLP & Computer Vision
  3. Machine Learning, Big Data & Cloud Computing
  4. Building data products
  5. Project Management and supporting clients

The topics that have been found from the job adverts are hardly surprising although it’s reassuring to know that processing the text data within these adverts and applying LDA to them has resulted in expected results. Being able to generate skills and experience in the topics outlined above is therefore important for a data scientist in the current market.

Assigning topics to job postings

The contribution of each topic for a job is calculated in the LDA model. I’ve extracted the Top 3 Topics in order of dominance for the roles so a snapshot of what each job involves can be observed.

By filtering on the job role information (Role Type & Contract Type in particular) and assigning the top 3 topics, a reduced list of more relevant job adverts to my search can be created to read through further.

Thanks for reading. If you have any feedback, please feel to reach out by commenting on this post or messaging me on LinkedIn.

LinkedIn: https://www.linkedin.com/in/t-caffrey/

Github Repository: https://github.com/tcaffrey/LDA_Job_Search

--

--