Extracting keywords from text: A case study with Japanese job postings

Published in

The HumAIn Blog

8 min readJul 20, 2021

Extracting keywords from the text can be a challenging task. Today we’ll take a close look on this problem using a Japanese text dataset of jobs.

This problem statement can be approached in 3 ways:

Supervised Approach
Semi-Supervised Approach
Unsupervised Approach

Supervised Learning Approach

For the supervised learning approach to work, we will require training data i.e., a list of skills extracted and its corresponding Job Description. Using the training data we can train a Named Entity Recognition Model to identify similar skill keywords objects, tag them, and extract the skill keywords from the Job Descriptions.

Pros of this approach:

The entity recognizer is trained to perform: Rule-based Parsing, Dictionary lookups, POS Tagging, Dependency Parsing.
It is robust to a certain extent and the robustness largely depends on the size of the training data
It is easy to build

Cons of this approach:

We do not have a readily available list of skills in the Japanese language.
It is hard for the model trained on a small sample to extract unseen keywords when the context is tricky.
It is computationally expensive for big datasets.

Semi-Supervised Learning Approach

We can define category lists of skills and have at most 10 skills per category. Then we can train a semi-supervised word2vec word embeddings model to extract skill keywords under each skill category and create a skills dictionary.

The logic behind this approach is that: word embeddings represents the meaning of a word using a word vector. Word vectors can in turn be used to text-mine the underlying relationship b/w two words using cosine similarity b/w the two-word vectors. The cosine similarity score b.w two-word vectors will inform us how similar or synonymous are the two words. Based on the learned similarity relationship to seed words describing a particular skill value, a broad set of words and phrases that describe that skill category value can be identified and extracted.

Example Category List:

{‘Technology Skills’: [‘C’, ‘Python’, ‘JavaScript’, ‘Scala’, ‘Microsoft Office’, ‘github’, ‘Machine Learning’, ‘Docker’, ‘CICD’, ‘Automation’],
‘Communication Skills’: [‘English’, ‘Ettiquette’, ‘Listening’, ‘Speaking’, ‘Friendliness’, ‘confidence’],
‘Cognitive Skills’: [‘Problem Solving’, ‘Research’, ‘Analytical’, ‘Critical Thinking’ ‘Math’, ‘Statistics’]}

Pros of this approach:

The semi-supervised word2vec learning model is tested on similar problem statements and is proven to be efficient.
We can extract not only words but also word_phrases with this approach

Cons of this approach:

We don't know the optimal number of Skill cluster topics that should be used to identify skills under different categories.
The skill keywords in our Japanese Job postings dataset have both English and Japanese skill words and I am not skilled in reading or writing Japanese.

Unsupervised Learning Approach

This is the most suitable approach for our current dataset, and the skill keywords mined from this approach can be used to build a NER model, and also we can cluster the skill keywords extracted with the unsupervised approach to learn optimal topic clusters, seed keywords, and build a semisupervised word2vec model to extract skill keywords per document and score the topic clusters of each document determining the diversity of skill requirements for a job posting. Let us take a look at this approach and our dataset in the next section.

THE DATASET

We have 70589 Unique Job Descriptions selected from 53 Job Posting platforms. The Job Postings are openings at 1246 Tokyo Stock Market Listed Japanese Firms.

Descriptive Stats of the dataset:

#   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   ID                             70589 non-null  float64
 1   English company name           70589 non-null  object 
 2   company_name                   70589 non-null  object 
 3   corp_name                      70589 non-null  object 
 4   industry                       70589 non-null  object 
 5   Industry major classification  70589 non-null  object 
 6   Industry minor classification  70589 non-null  object 
 7   website                        70589 non-null  object 
 8   url                            70589 non-null  object 
 9   job_postedlocation             70589 non-null  object 
 10  job_headoffice_location        70589 non-null  object 
 11  website_classification         70589 non-null  object 
 12  website_categories             70589 non-null  object 
 13  jobtype_classification         70589 non-null  object 
 14  jobfunction_tags               70589 non-null  object 
 15  job_descr                      70589 non-null  object 
 16  application_conditions         70589 non-null  objectID unique len:  1246

English company name unique len:  1244

company_name unique len:  1661

corp_name unique len:  1246

industry unique len:  9

Industry major classification unique len:  5

Industry minor classification unique len:  19

website unique len:  53

url unique len:  70180

job_postedlocation unique len:  48

job_headoffice_location unique len:  48

website_classification unique len:  78

website_categories unique len:  11093

jobtype_classification unique len:  10

jobfunction_tags unique len:  18420

job_descr unique len:  70589

application_conditions unique len:  36667

The challenging part of extracting the skill keywords from the job description documents in our dataset are as follows:

Many job postings in Japan do not focus on listing stringent skill requirements. They rather focus on informing the applicants of the challenging tasks for the position and the benefits offered for the position. It is a description document with information mostly on “what we offer” rather than “what we look for in you”.
Some Job Postings are not wordy enough to extract skill keywords
The documents are a mix of both Japanese and English words.

Text Cleaning Pipeline

The Text cleaning pipeline plays a crucial role in improving the accuracy of the keyword extractor. We used the following python tools to perform text cleaning: re, nltk, MeCab, neologdn, and spacy

Sample Job Description:

'お一人暮らし希望の方は必見！ ・独身寮あります！ 23、000円/月の家賃で新生活ができます！ 病院・看護職員さん募集！ 療養上の患者さんのお世話・診察の補助病務を対応していただきます。 多職種と協働し、チームで患者中心の個別的医療・看護を行って頂きます。 診療(治療)を行う上での補助や患者の日々の観察・社会復帰に向けての支援や援助を行って頂きます。 ※当直業務が月に5回程度ございます。 (回数は相談可能です) ※ブランクのある方・未経験の方もOK！ 採用時から丁寧に指導して頂けるので安心です。 ※時間外勤務はほとんどありませんが、業務多忙の際には、残っていただく場合あり(月5時間程度) 【施設概要】 一般病棟入院基本料 急性期一般入院料5 50床(DPC対象) 地域包括ケア病棟：42床 回復期リハビリテーション病棟：46床 医療療養病棟：34床 (診療科目) 外科/肛門外科/消化器外科/内科/脳神経外科/ 整形外科 産婦人科/泌尿器科/麻酔科/リハビリテーション科/放射線科 循環器内科/神経内科/眼科/皮膚科/耳鼻咽喉科/乳腺外科 リウマチ科 【ツクイスタッフでは高収入なお仕事多数！】 □家事育児と両立したい □介護・医療に興味がある □高収入なお仕事がしたい □介護・看護免許あるけど… □短時間で働きたい！ そんな、主婦(主夫)ママ女性男性大活躍中！あなたのライフスタイルにあったお仕事を見つけるのは、ツクイスタッフ♪'

As you can see in the above example, the description document comes with some structure and the structure is determined by categorical labels within【and】. There are also pointers that define a requirement. Understanding the categories and the structure of the description document is important to eliminate unwanted words and improve the accuracy and time complexity of the text analytics pipeline.

We found 7280 unique encoded categories across all the 70k description documents

It would be beneficial to categorize these labels to understand relevant context fields and filter out words contained in the irrelevant context fields from the document. However, we skipped this step since it requires manually reading and understanding at least a few hundred labels.

The below gist performs text cleaning of the Job Descriptions column

After cleaning the Job Description, the next crucial step is to annotate and remove irrelevant words from the document. For this exercise, we used Spacy and annotated a bunch of documents, and it was conclusive that the skill keywords are Noun Words in the documents. However, the nouns alone don't make perfect sense all the time. Hence, we extracted the noun phrases from the document using spacy’s noun_chunks method.

Visualizing to understand POS tag word annotations

After applying the text cleaning and noun chunks filters on the job description column, most of the Job Description fields had no or fewer words due to the inferior quality of job descriptions. The dataset was reduced to half the size after dropping all the documents with less than 25 noun chunks. In this subset, most of the jobs were Nursing and Driving jobs which had no good quality of skill keywords mentioned. Hence, a manual filter was added to choose documents with skills in python, and java mentioned in the document.

UNSUPERVISED KEYWORD EXTRACTOR

There are multiple choices to perform this step: from corpus-dependent TF-IDF algorithm to graph-based algorithms like textRank, lexRank, to corpus-independent algorithms such as RAKE, and YAKE.

The YAKE solution is chosen to extract skill keywords from the list of job descriptions. This method of skills keyword extraction has worked well when it was applied to technical jobs, in other scenarios non-skill keywords were also identified as potential keywords by the YAKE algorithm. Hence, if the documents are not of good quality with little or no topic-dominant keywords to extract the algorithm fails.

After extracting keywords from a document, the next step is to create a dictionary list of skills. We can do this by using gensim Dictionary library which gives us the power of filtering out keywords based on term frequency and inverse document frequency filters.

From all the IT Jobs the Yake extractor was able to extract 900 unique keywords. A sample is as shown below:

The above dictionary list is very close to a list of skill keywords. The quality of this list can be improved further by applying corpus-dependent weights to the yake_tokenized_sentences and choosing ‘n’ top keywords per document to form the skill dictionary. Further, we can apply:

split multiple meaningful words that are joined to one another. This can be achieved by the word ninja python package for English words. However, the documents we have are in both Japanese and English. Hence, we need a more suitable solution here.
spell checker to rectify or eliminate keywords that are misspelled. We have predefined Japanese and English spell checker pipelines by JohSnowLabs which can be used to rectify or eliminate misspelled words.

In this blog, we tried to tackle a complex NLP problem statement of extracting skill-topic keywords from job descriptions that are in both Japanese and English Language. We used spacy, Mecab, and YAKE tools to solve this problem. The resulting list of skill keywords does come with some noisy keywords. However, this method can be further extended to perform skill-keyword clustering to determine the optimal number of skill-topics and seed keywords for the topic by applying gensim- topic modeling. Then we can utilize the outcome of topic modeling to develop the semi-supervised word2vec skill keyword and word_phrases extraction tool. Further, we can clean this list of skill keywords manually to some extent and develop a NER Model to extract skills from job descriptions in a supervised manner.