Information Extraction from CV

Priya Sarkar
5 min readMar 5, 2018

--

Different people from different fields and different background have varied personalities. Similarly their CV writing pattern also fluctuates.They have worked in different type of projects and each of them possess a varied style of writing it down. Thus making each CV unique in itself.

I was once working with a HR consulting startup. Everyday they used to crawl hundreds of CV’s from the internet. After gathering the CV’s, their calling executives used to summarise the CV, enter specific details into their database and then call the candidate for job consulting. An executive took around 10–15 mins per CV to summarise it and enter the details into the database. My job was to automate this process.

I was working in this project along with my co friend Abhinav Garg. You can find the github link at the end of the article.This program could read several format of files (CV) stored inside the resume folder.It uses basic techniques of Natural Language Processing like word parsing,chunking,reg ex parser. If you run the algorithm you can easily capture information like name,email id,address,educational qualification,experience in seconds from a large number of documents.

Step 1:Converting Miscellaneous format of resume into text format

The standard formats in which people write their resumes are pdf, rtf or simple docx.In order for Python to extract information from them ,our first step would be to convert them to .txt format.

We are using pdfminer for converting pdf to text

We are using Rtf15Reader to convert rtf to text

We are using docx for converting docx files to text

Step2:Breaking entire document into sentences, lines,tokens

We break the entire document on the basis on new lines, tokenise each line and tag them with there POS tags (<word>, <tag>)and name this variable as lines

We create another variable named sentences, which does the same functionality as above. But the only difference is that it is created using Sentence Tokenizer. Finally we create our final variable which is tokens, which is a list of tokenised sentences.

Step3:Information Extraction

Extracting Email address and Phone number from CV’s

Email address and Phone number are well defined patterns in themselves. Thus we would be using Regular Expressions in order to capture them in the CV.

But even after that we sometime tend to capture noises such as date values(2012–09–12), year ranges (1990–2000) or pin codes. Thus we need to clean our matches.

Pattern Used for capturing Experience:

People usually use the term “experience” when they explicitly mention their years of experience in the CV. Thus within the entire CV we look for lines which contains the term “experience” in them and capture the cardinal number from the same sentence.

Pattern Used for capturing Name:

We use a Reg Ex Parser to capture potential names from the CV. Names are made up of two or three types of noun tags (ie NN, NNP etc.).Thus we create a parser which searches the entire CV and outputs word phrases from the CV which are in the form of 3 or more continuous nouns.

But we can get several potential candidates which are in the form of 3 continuous nouns, for example an address can even be captured.Thus we have downloaded a file which contains all potential Indian names in it and we check it against our captured potential named candidates, via the reg ex parser.

Alternate approach- If you have names which can be recognised by a Name Entity Recogniser (NER) tagger.Simple use the tagger to identify names from potential sentences.

Pattern Used for capturing Qualification details:

We had created this function in order to find details for a particular qualification, for example “CA”,”ICSE”,”B.Tech”. We need to input the function with the qualification details we are looking for.

Within the entire document, we search for only those lines, which either contains the word D1 or D2. Thus making that line a suitable candidate for searching the other details such as university name, year of passing, marks received for that particular qualification.

We follow a similar approach that we used to capture name to capture qualification details- university name, percentage, year from the CV. We use Reg Ex parser to capture the name of the institute.We try to identify patterns in University names in order to capture them from the CV. Any University name can either start with “The” or any proper noun for example- The Institute Of Chartered Accountancy, The Bishop’s College etc. The University name might also contain a determiner like ‘Of’ or ‘and’ followed by proper nouns. And finally in all the University name’s there is one thing common they either contain a word like ‘university’,’college’,’vidyapeeth’,’institute’ etc. Thus we create a dictionary containing all such potential words and finally match it against the potential university name candidates. Thus we created our reg ex parser as:-

And for capturing the graduation year, we use a regular expression within the same line to capture the year.

We can similarly create a regex in order to capture the marks.Thus you can use similar concepts for capturing several other characeristics from the CV.

You can find the entire code at:-

https://github.com/divapriya/Language_Processing

--

--

Priya Sarkar

Data Scientist JP Morgan, graduate from IIT Bombay. Love creating new products and trying new technology. https://www.linkedin.com/in/priya-sarkar-60248171