Case Study: NLP Based Resume Parser Using BERT in Python
The main objective of the Natural Language Processing (NLP)-based Resume Parser in Python project is to extract the required information about candidates without having to go through each and every resume manually, which ultimately leads to a more time and energy-efficient process.
Resumes are commonly presented in PDF or MS word format, And there is no particular structured format to present/create a resume. So, we can say that each individual would have created a different structure while preparing their resumes.
It is easy for us human beings to read and understand those unstructured or rather differently structured data because of our experiences and understanding, but machines don’t work that way. Machines can not interpret it as easily as we can.
The conversion of cv/resume into formatted text or structured information to make it easy for review, analysis, and understanding is an essential requirement where we have to deal with lots of data. Basically, taking an unstructured resume/cv as an input and providing structured output information is known as resume parsing.
Resume parser is an NLP model that can extract information like Skill, University, Degree, Name, Phone, Designation, Email, other Social media links, Nationality, etc. irrespective of their structure.
To create such an NLP model that can extract various information from resume, we have to train it on a proper dataset. And we all know, creating a dataset is difficult if we go for manual tagging.
To reduce the required time for creating a dataset, we have used various techniques and libraries in python, which helped us identifying required information from resumes. However, not everything can be extracted via script so we had to do a lot of manual work too. For manual tagging, we used Doccano. Doccano was indeed a very helpful tool in reducing time in manual tagging.
Below are the approaches we used to create a dataset.
Approaches to create Dataset
- Natural Language Processing (NLP)
- Python libraries/Python Packages
- Predictive Analytics
- Regular Expression/Rule-Based Parsing
- Named Entity Recognition (NER)
- Spacy’s NER
- BERT NER
Pdf to text conversion
It looks easy to convert pdf data to text data but when it comes to converting resume data to text, it is not an easy task at all.
We have tried various open-source python libraries like pdf_layout_scanner, pdfplumber, python-pdfbox, pdftotext, PyPDF2, pdfminer.six, pdftotext-layout, pdfminer.pdfparser pdfminer.pdfdocument, pdfminer.pdfpage, pdfminer.converter, pdfminer.pdfinterp. Each one has their own pros and cons. One more challenge we have faced is to convert column-wise resume pdf to text.
After trying a lot of approaches we had concluded that python-pdfbox will work best for all types of pdf resumes.
Docx to text conversion
First, we were using the python-docx library but later we found out that the table data were missing.
After that our second approach was to use google drive API, and the results of google drive API seem good to us but the problem is we have to depend on google resources and the other problem is token expiration.
Somehow we found a way to recreate our old python-docx technique by adding table retrieving code. And it is giving excellent output. (Now like that we don’t have to depend on google platform). Here note that, sometimes emails were also not being fetched and we had to fix that too.
It is easy to find addresses having similar format (like, USA or European countries, etc) but when we want to make it work for any address around the world, it is very difficult especially for Indian addresses. Some of the resumes have only a location and some of them have full addresses.
We have tried various python libraries for fetching address information such as geopy, address-parser, address, pyresparser, pyap, geograpy3 , address-net, geocoder, pypostal.
Finally, we have used a combination of static code and pypostal library to make it work, due to its higher accuracy.
Manual label tagging is way more time-consuming than we think. Since we not only have to look at all the tagged data using libraries but also have to make sure that whether they are accurate or not, if it is wrongly tagged then remove the tagging, add the tags that were left by script, etc.
We have used the Doccano tool which is an efficient way to create a dataset where manual tagging is required. We highly recommend using Doccano.
Nationality tagging can be tricky as it can be language as well. For example, Chinese is nationality too and language as well. So, we had to be careful while tagging nationality.
Limitations in creating Dataset
It can not parse informations like:
- Year of graduation
- Strength and weakness
- Objective / Career Objective: If the objective text is exactly below the title objective then the resume parser will return the output otherwise it will leave it blank
- CGPA/GPA/Percentage/Result: By using regular expression we can extract candidate’s results but at some level, not 100% accurate
- Date of birth:
- As the resume has many dates mentioned in it, we can not distinguish easily which date is DOB and which is not.
- We can try an approach, where, if we can derive the lowest year date then we may make it work but the biggest hurdle comes in the case, if the user has not mentioned DoB in the resume, then we may get the wrong output.
Training the model
Instead of creating a model from scratch, we used BERT pre-trained model so that we can leverage the NLP capabilities of the BERT pre-trained model.
Our NLP-based Resume Parser demo is available online here for testing.
Currently, the demo is capable of extracting Name, Email, Phone Number, Designation, Degree, Skills and University details, various social media links such as Github, Youtube, Linkedin, Twitter, Instagram, Google Drive,
Flaws in the result of NLP Model
Fetching the address
- Even after tagging the address properly in the dataset, we were not able to get a proper address in the output.
- One of the major reasons to consider here is that, among the resumes, we used to create a dataset, merely 10% of resumes had addresses in it.
- Improve the accuracy of the model to extract all the data.
- Test the model further and make it work on resumes from all over the world.
- Make the resume parser multi-lingual
- Improve the dataset to extract more entity types like Address, Date of birth, Companies worked for, Working Duration, Graduation Year, Achievements, Strength and weaknesses, Nationality, Career Objective, CGPA/GPA/Percentage/Result.
- Make Resume Parser available via API
Originally published at NLP Based Resume Parser Using BERT In Python.