Smart Recruitment — Cracking Resume Parsing through Deep Learning (Part-I)

Ankit Gupta
Feb 19 · 4 min read

AI has helped us solve problems which were earlier considered either unsolvable or too computationally expensive. Recruitment or Talent Acquisition is currently being disrupted with the technology of AI. One such ‘hard to crack’ problem in this domain is that of Resume Parsing, which if solved with precision, could considerably save the time of recruiters in executing the repetitive — tedious task of manually screening resumes.

We at Skillate, are building Smart Recruitment technology to help identify, engage and hire the best candidates using AI. Resume Parsing can be considered as the first step towards achieving this goal.

It took us about a year and a half to develop a state-of-the-art (SoTA) Resume Parser which achieves more than 90% accuracy even on the most complex resumes (after testing over thousands of resumes). As you can guess by now, solving this problem with such high accuracy, required us to leverage the power of cutting edge Deep Learning technologies in AI. In this post, we would like to share the knowledge and experience gained while building this Artificially Intelligent piece of software.

Resume Parsing, formally speaking, is the conversion of a free-form CV/resume document into structured information — suitable for storage, reporting, and manipulation by a computer.

The problem of Resume Parsing can be broken into two major subproblems — 1. Text Extraction, and 2. Information Extraction. For building a SoTA resume parser, both these problems need to be solved with the highest possible accuracy. In this post, we will be talking about Text Extraction, while Information Extraction will be discussed in the upcoming articles.

Text Extraction

The Challenges

Almost everyone tries to use a unique template to put information on their CV. Even the templates that might seem indistinguishable to the human eye, are processed differently by the computer. This creates the possibility of hundreds of thousands of templates in which resumes are written worldwide. Not all templates are straightforward to read from. For eg. One can find tables, graphics, columns in a resume, and every such entity needs to be read in a different manner. Therefore it is easy to conclude that rule-based parsers do not stand a chance and an intelligent algorithm is required to extract text in a meaningful manner from raw documents (pdf, doc, docx, etc).

Our Approach

We explored several libraries to extract text from pdf, doc, docx, etc type of documents, but none of them could provide the quality of results we were aiming to reach. It became evident that text extraction could not be solved by a single type of algorithm alone.

So we first created an entirely new classification system to segregate the resumes into different types, based on their template, and tackle each type differently. Some of the types were straightforward, but most of them (like the ones that contain tables, partitions, etc) required higher order intelligence from the software. For such complex types, we decided to use Optical Character Recognition (OCR) along with some Deep NLP algorithms on top, to extract text.

For every problem, there is a hard way and a smart way to solve, and we decided to go with the later. OCR is a very generic problem which has been researched upon and solved by the biggest tech companies in the world. The best part is that this technology has been open sourced as well! Therefore, in this context, the hard way would be to build a deep learning model from scratch for OCR and NLP, and the smart way was to use the power of open source and deploy an off the shelf model for the task.

Conclusion

With the help of our classification algorithm to segregate the resumes, we were able to amalgamate different technologies and obtain the best of everything, to build a highly accurate and fast text extraction method. Currently, we are able to extract text accurately from about 98% of simple resumes and 90% of the complex ones.

In the next article, we will be talking about the deep learning technology we built ourselves from scratch, for the Information Extraction task. Stay Tuned!

Thank You

Link to the second part: https://bit.ly/2TX4iMz

Ankit Gupta

Written by

NLP Engineer by profession, Musician at heart. Work: skillate.com

Skillate

Skillate

Skillate is an AI-enabled recruitment platform helping companies to optimize their recruitment operations.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade