Resume Information Extraction (Part 2)

Research Papers That You Should Not Let Slip

Pei Seng Tan
ViTrox-Publication
6 min readMay 5, 2022

--

Photo by Markus Winkler on Unsplash

This article is the sequel of Resume Information Extraction (Part 1) — Research Papers That You Should Not Let Slip. For those who haven’t read it yet, please do so.

Paper 4: Resume Information Extraction with A Novel Text Block Segmentation Algorithm (Zu and Wang, 2019)

The authors suggest an end-to-end resume parsing pipeline based on neural network-based classifiers and distributed embeddings to make use of position-wise line information and integrated word representations inside each text block. The authors’ motivation for proposing this is as follows:

  1. Neural network-based feature extraction outperforms hand-crafted features in capturing more semantics from texts.
  2. Word embeddings served as “universal feature extractors” can better represent words than human-designed features.
  3. Pre-trained word embeddings are convenient to use.

Figure 1 depicts the whole pipeline for the proposed resume information extraction technique.

Figure 1: Pipeline for the proposed resume information extraction technique.

The proposed text block classification method trains two kinds of line classifiers, namely line type classifier and line label classifier. The line type classifier’s purpose is to break resumes into basic parts based on four different types of generic layouts: header, content, metadata, and footer. Personal, education, work, project, skill, and publishing are the six general information areas that the line label classifier aims to segment. Table 1 illustrates the categories and subcategories of six general information fields.

Table 1: Predefined information field.

Pre-trained word embeddings particularly for the resume-related domain are generated by using the Word2Vec model from the Gensim tool as the default model to train word embeddings by applying to phrase collections or corpora effectively to transform line lists in text resumes into word vectors. Other word representation models, such as GloVe and BERT, are also tested. During training, these embeddings would be fine-tuned. Text-CNN, RCNN, Adversarial LSTM, Attention BLSTM, and Transformer are the types of neural network architecture that are considered for text block classification.

The process of text sequence labeling is carried out after the completion of text block classification. The sub-categories, which are mentioned in Table 1, are used for the named entity recognition (NER) which tags phrases in the sentence. Bi-LSTM-CRF, Bi-GRU-CRF, IDCNN-CRF, and BLSTM-CNNs-CRF are four sequence labeling classifiers that are trained and assessed in terms of performance and decoding speed.

The authors used the k-means technique to cluster the detected named entities and compute the cosine similarity between them based on Term Frequency–Inverse Document Frequency (TFIDF) to translate the named entity candidates to the standard attribute names.

As demonstrated in Figure 2, Attention BLSTM beat other text classifiers in the line type classification, with F-1 values of 0.96, 0.93, 0.96, and 0.97 for header, content, metadata, and footer, respectively.

Figure 2: Precision (P), Recall (R), and F-1 Measure (F-1) of line type classification performed by five text classifiers for four generic layouts.

With regards to the line label classification for the six general information fields, as shown in Figure 3,

  1. Attention BLSTM and Adversarial LSTM outperform other classifiers in classifying long sentences with higher recalls and F-1 measures.
  2. Text-CNN outperforms the other text classifiers for short phrases.
  3. RCNN achieves better classification performance over Text-CNN in terms of long sentences.
Figure 3: Precision (P), Recall (R), and F-1 Measure (F-1) of line label classification performed by five text classifiers for six general information fields.

Given Attention BLSTM’s favorable classification performance and strong robustness against both short and long sentences, the authors decided to use Attention BLSTM to segment text blocks in practical implementation.

Table 2 compares the outcomes of four suggested sequence labeling classifiers when it comes to identifying resume facts. The authors came to three findings based on their comparative investigation.

  1. BLSTM-CNNs-CRF outperforms the other three sequence labeling classifiers.
  2. The greedy ID-CNN outperforms both Bi-LSTM and Bi-GRU when paired with Viterbi-decoding.
  3. Bi-GRU slightly outperforms Bi-LSTM in the NER task when paired with the CRF.

Regarding the decoding speed, as shown in Table 2 as well, IDCNN-CRF has the fastest decoding speed.

Table 2: The F-1 Measures for resume facts identification by four sequence labeling classifiers.

Other findings provided by the authors include:

  1. BLSTM-CNNS-CRF’s CNN Layer is an Effective Text Feature Extractor compared to LSTM Layer
  2. The proposed approach outperformed Writing-Style and CHM.
  3. BERT is the most advanced word representation algorithm compared to randomized initialization, Word2Vec, and GloVe.

Paper 5: Resume Parsing Framework for E-recruitment (Sajid et. al., 2022)

The disadvantage of supervised learning, as used in Paper 4, is that it necessitates a substantial amount of resume annotation. In the actual world, gathering a large number of annotated datasets is difficult. To address this, this author’s study proposed a resume parsing framework that addresses the shortcomings of earlier methodologies, such as rule-based, supervised, and semantics-based methods. The proposed structure for resume processing in this paper is shown in Figure 4.

Figure 4: Proposed structure for resume processing.

The raw text is taken from resumes first. PDFBox is a utility for converting PDF files to text files. Non-ASCII characters, missing or additional spaces, and punctuation marks are all handled correctly. For text block categorization, the blocks are divided using Boolean Naive Bayes (BNB). The entities are then extracted using BERT for NER.

Unlike prior works, the named entities are further extended with a custom-built ontology to address data sparsity employing skills from online job portals, notably the CSO and European Skills/Competence, Qualification, and Occupation ontologies (ESCO). Ontology is a representation of a knowledge base that allows for the construction of a semantic model of data connected with specific domain knowledge. Ontology is also used to establish relationships between various types of semantic knowledge in a domain.

The process of building an ontology includes:

  1. Ontology Building: Specify motivation, define informal competency questions, build schema and construct axioms.
  2. Skill Enrichment and Normalization (an Example is shown in Figure 5).
Figure 5: Skill Enrichment.

The conceptual models of skill ontology are shown in Figure 6. The circle and rectangles depict the classes and instances respectively. Edges represent the relationships between skills. (a) shows the overall schema of the skill ontology containing classes, subclasses, and instances. (b) depicts the details between multiple instances. The SameAs property shows the equivalence between two skills whose representations might be different in the real-world, for instance, Microsoft Office.

Figure 6: Conceptual models of skill ontology

Other Related Papers:

The papers listed below are ones I’ve read but didn’t include in this or earlier articles.

  1. An Ontology-Based Information Extraction Approach for Résumés (Celik and Elci, 2012)
  2. Combination of Neural Networks and Conditional Random Fields for Efficient Resume Parsing (Ayishathahira et. al., 2018)
  3. Catapa resume parser: End to end Indonesian resume extraction (Lumban et. al., 2019)
  4. Resume extraction with conditional random field method (Yu, et. al., 2020)
  5. A Contextual Model for Information Extraction in Resume Analytics Using NLP’s Spacy (Channabasamma et. al., 2021)

Conclusion:

  1. [Paper 4] Attention BLSTM and BLSTM-CNNs-CRF are the best candidates for text-block classification and NER tasks respectively.
  2. [Paper 4] Work Embeddings significantly improve the performance of neural network classifiers or recognizers.
  3. [Paper 5] Ontology helps to address data sparsity employing skills from online job portals.

References:

  1. Zu, S., & Wang, X. (2019). Resume information extraction with a novel text block segmentation algorithm. Int J Nat Lang Comput, 8, 29–48.
  2. Sajid, H., Kanwal, J., Bhatti, S. U. R., Qureshi, S. A., Basharat, A., Hussain, S., & Khan, K. U. (2022, January). Resume Parsing Framework for E-recruitment. In 2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM) (pp. 1–8). IEEE.

--

--