Resume Information Extraction (Part 1)

Research Papers That You Should Not Let Slip

Pei Seng Tan
ViTrox-Publication
8 min readMar 24, 2022

--

Photo by João Ferrão on Unsplash

The fast transition of recruitment from traditional job fairs to web-based e-recruiting platforms has occurred over the last 10 years. According to the statistics, well-known third-party e-recruitment sites, such as Monster and LinkedIn, publish more than 300 million resumes each year (Zu and Wang, 2019).

The resume is a formal document used by job searchers to demonstrate their skills and competencies to the human resources departments of targeted firms or headhunters to secure desired positions. The job of automatically extracting structured information from unstructured or semi-structured machine-readable documents and other electronically represented sources is known as information extraction.

Because of the large volumes of personal data in various formats, how to efficiently access each resume is a serious issue that has successfully piqued the interest of academicians (Chen et al., 2018). Therefore, the motivation that drives me to write this article is to dissect the resume information extraction pipeline from the perspectives of academicians.

Paper 1: Resume Information Extraction With Cascaded Hybrid Model (Yu et al., 2005)

According to the study on the ways human beings prepare their resumes, resume information can be typically defined as a two-layered hierarchical structure. The first layer consists of sequential generic information blocks such as personal information, education, and so on. Inside each general information block, detailed information pieces can be further analyzed. For an instance, in the block of personal information, detailed information, such as name, address, email, and so on, can be extracted. Table 1 shows the information types mentioned by the authors in the paper.

Table 1: Information types (Yu et al., 2005)

The authors proposed a cascaded information extraction framework to perform resume information extraction. Instead of searching the entire resume, as is done with the flat model, a resume will firstly be segmented into consecutive blocks attached with labels indicating the information types, followed by the identification of detailed information within a specific block.

Figure 1 shows the whole pipeline of resume information extraction using the proposed cascaded framework.

Figure 1: Structure of cascaded hybrid models (Yu et al., 2005)

Experimenting with 1200 Chinese resumes, the result reveals that the hierarchical structure of resume information extraction with the suggested cascaded framework enhances the average F-score of detailed information extraction, as shown in Table 2.

Table 2: Result with cascaded and flat models (Yu et al., 2005)

By correctly mixing multiple information and extraction models in different layers, it is also effective to attain excellent precision and recall. Based on the results shown in Table 3 and Table 4, the Hidden Markov Model (HMM) is good in handling the general information extraction and educational detailed information extraction whereas the Support Vector Machine (SVM) is recommended for the personal detailed information extraction.

Table 3: General information extraction with different models (Yu et al., 2005)
Table 4: Detailed information extraction with different models (Yu et al., 2005)

Paper 2: Information Extraction from Resume Documents in PDF Format (Chen et al., 2016)

Similar to Paper 1, the authors adopt a hierarchical information extraction approach to access data from resume documents.

Initially, a page is divided into blocks based on heuristic rules. Following that, a well-trained classification model, which is a Support Vector Machine (SVM) model, is used to categorize each block into pre-defined categories. The detailed information extraction work is then viewed as a sequence labeling issue by using a Conditional Random Fields (CRF) model to perform the task.

Figure 2 shows the flow of the entire resume information extraction pipeline.

Figure 2: Flow of resume information extraction pipeline (Chen et al., 2016)

Different from Paper 1, the authors proposed to take the advantage of the structure and layout information of PDF documents to improve the accuracy of resume information extraction by feeding the classification model with two kinds of features, which are content-based and layout-based. The content-based features are normally texts whereas the layout-based features refer to the font size, font style, bounding box, and line position information.

The experimental result verified with 400 resume documents shows that the introduction of layout-based features into modeling significantly increases the precision, recall, and F1-score, as shown in Table 5.

Table 5: Contribution of features (Chen et al., 2016)

Due to the page being divided into blocks based on heuristic rules, the performance of entire resume information extraction may be varied with the block sizes. To confirm this, the authors had a study on this.

The block size of the flat structure is the entire document. The hierarchical structure of tiny blocks treats each paragraph as a text block. The hierarchical structure of large blocks is the result of page segmentation based on the heuristic rules defined by the authors.

To go deep, the heuristic rules are designed according to a recursive bottom-up algorithm. The size of blank spaces between lines is ranked. Little blocks are merged vertically and horizontally based on certain thresholds and bounded by constraints to prevent miss-segmenting or over-segmenting.

As shown in Table 6, the hierarchical structure outperforms the flat structure in terms of precision and F1-score, with just a minor loss in the recall. It strengthens the Paper 1 authors’ view.

Table 6: The effect of the different block sizes (Chen et al., 2016)

In addition, the experimental result demonstrates that the hierarchical model with big blocks is more efficient in terms of F1-score. Choosing a little block increases the number of mistakes in the categorizing block stage.

By further performing error analysis on the built pipeline, the authors managed to improve the performance by addressing specific problems in detailed information extraction, including:

  1. The issue of the under-graduate university and graduate university of job searchers are the same.
  2. The issue of the name attribute does not appear in the personal block but appears as the title of the documents.

Paper 3: A Two-Step Resume Information Extraction Algorithm (Chen et al., 2018)

Similar to Paper 1 and Paper 2, the authors applied a two-step resume information extraction method to get data from resume documents.

In the first step, the raw text of the resume is recognized as separated blocks. Different from aforesaid papers, a unique feature, namely Writing Style, is generated from sentence syntax information modeling to further improve resume information extraction. The Writing Style encodes not only word and punctuation indexes but also word lexical attributes and prediction results of classifiers. In order words, the Writing Style is a kind of syntax feature about the structure of a line in a resume. In the second step, multiple classifiers are adopted to recognize distinct features within each separate resume block.

Figure 3 shows the proposed pipeline.

Figure 3: Proposed Pipeline (Chen et al., 2018)

To obtain the Writing Style features, the authors use TIKA to process raw resume documents. The limitation of TIKA includes the raw text is not in accordance with the original layout. There is a lot of noise within the lines in each text file, such as continuous blank, incorrect newlines, and the essential space missing. three types of operations are involved, namely merging, splitting, and trimming, to process raw resume documents. More details about the heuristic rules for data cleaning are stated in Table 7.

Table 7: Heuristic rules for data cleaning (Chen et al., 2018)

After the raw resume documents are processed, a simple multi-class classifier is trained to perform a similar task like named entity recognition. The labels include university name, job position name, department name, ID number, address, and date. The probability distribution on a different class is combined with the position of the phrase and the symbol to form Writing Style features. In other words, the probability distribution on a different class is obtained to enrich the inputs before both text block classification and resume facts identification steps.

In the step of text block classification, the authors define three types of different lines to facilitate the follow-up work as follows:

  1. Simple — The line is a short text and may contain a few blanks.
  2. KeyValue — The line follows the key and value structure, with comma punctuation.
  3. Complex — The line is a long text, which contains more.

The results displayed in Table 8–10 reveal that the Writing Styles help to improve the performance of text block classification while comparing to the performance of PROSPECT (Singh et al., 2010) or CHM (Yu et al., 2005).

Table 8: Education Block Classification (Chen et al., 2018)
Table 9: Work Experience Block Classification (Chen et al., 2018)
Table 10: Basic Information Block Classification (Chen et al., 2018)

In the step of resume facts identification, those lines with key-value structures are considered to be the candidate attribute. Firstly, the Cosine Similarity is computed based on Term Frequency-inverse Document Frequency (TF-IDF) and the K-means is applied to perform attribute clustering for standard attribute name matching.

The proposed method shows significant improvement in terms of the resume annotation hours required. As presented in Figure 4, the proposed system does not need too much manually annotated training set, which can save lots of human effort and time, delivering a piece of good news to the project teams which are not big in size.

Figure 4: Hours of annotations (Chen et al., 2018)

Conclusion

The main key idea that I got from these 3 papers for resume information extraction include:

  1. [Paper 1] The hierarchical cascaded model structure performs better than the flat model structure.
  2. [Paper 2] The layout-based features help to improve the performance of resume information extraction.
  3. [Paper 2] The size of the resume block is matters as it will affect the performance of block classification.
  4. [Paper 3] The syntax features, including word lexical attribute and prediction results of text classifiers, help to improve the performance of resume information extraction.
  5. [Paper 3]The use of clustering algorithm, like K-means, reduces the human work in annotations.

References:

  1. Yu, K., Guan, G., & Zhou, M. (2005). Resume information extraction with cascaded hybrid model. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) (pp. 499–506).
  2. Chen, J., Gao, L., & Tang, Z. (2016). Information extraction from resume documents in pdf format. Electronic Imaging, 2016(17), 1–8.
  3. Chen, J., Zhang, C., & Niu, Z. (2018). A two-step resume information extraction algorithm. Mathematical Problems in Engineering, 2018.
  4. Zu, S., & Wang, X. (2019). Resume information extraction with a novel text block segmentation algorithm. Int J Nat Lang Comput, 8, 29–48.

--

--