Streamlining Resume Parsing with Google Document AI on Google Cloud AI Platform

Introduction

Have you ever read a job description that’s long and leading to a path of winding? Or read the terms & conditions of a product that you have to take at least 1 hour of your time to sit and read thoroughly?

In today’s competitive job market, both job seekers and employers are inundated with resumes. For recruiters and HR professionals, manually sifting through hundreds or thousands of resumes can be a daunting and time-consuming task. To address this challenge, organizations are increasingly turning to automation and artificial intelligence (AI) solutions to streamline the hiring process. Google Cloud’s powerful AI tools and services offer a comprehensive platform for building automated pipelines for resume parsing, making it easier to identify the most qualified candidates quickly and efficiently.

Google Document AI: The Foundation of Resume Parsing

Google Document AI is a state-of-the-art tool developed by Google Cloud that utilizes machine learning and natural language processing (NLP) techniques to extract valuable information from unstructured documents, such as resumes. It has the capability to understand and extract text, tables, and even handwriting from documents. By leveraging the capabilities of Document AI, organizations can create a resume parser that can automatically extract key information from resumes, such as contact details, work experience, skills, and education.

In this work, I decided to start by exploring the different AI tools that Google Cloud AI platform offers. The Document AI service was intriguing of all as it is able to automate the processing & the extraction of data in any formats, ranging from pdf’s, jpg’s, png’s or doc’s, using Natural Language Processing (NLP) besides providing options to use either a readily available pre-built model or create a custom model that fits our own criteria and requirements for own business use case, all while staying in cloud!

Secondly, I looked into the type of industry where this multi-purpose model could best contribute towards. After researching and gathering data on resumes from various sectors and studying the existing resume parsing models out there, I decided to create a resume parsing model that can be used by employers in the healthcare industry to ease their process of hiring potential employees. Using only tools & services available on the Google Cloud platform, I wanted to show the adaptable functionality of Google Cloud in any relevant industries or businesses as well as its capability in enabling us to create a beneficial AI models with the least amount of cost.

The key components of the automated pipeline that I experimented with, can be achieved using Google Cloud services and tools integrations, given the right authentication & authorisation:

1. Google Cloud AI Platform:

The Google Cloud AI Platform serves as the core infrastructure for building and deploying machine learning models. It provides a managed environment for training and serving machine learning models at scale. In the context of resume parsing, the AI Platform can be used to develop and fine-tune machine learning models that can extract specific information from resumes effectively.

2. Google Document AI Warehouse:

The Document AI Warehouse is a centralized repository for storing and managing document data. It allows organizations to securely store resumes, job applications, and other related documents in a structured and easily accessible manner. This warehouse serves as the data source for the resume parsing pipeline that could also integrate with Big Query to conduct queries on the uploaded resumes, using pre-built or custom schemas.

3. Application Integration:

To fully automate the resume parsing process, it’s essential to integrate the system with the organization’s job application platform or applicant tracking system (ATS). This integration ensures that resumes are automatically processed as they are submitted, saving time and reducing the risk of overlooking qualified candidates. This tool is a low-code tool that enables us to drag & drop onto the canvas dashboard and connecting each tools & services using connectors to create a pipeline.

4. Kubernetes:

Kubernetes is a container orchestration platform that can be used to manage and scale containerized applications. By deploying the resume parsing application in Kubernetes, organizations can ensure high availability, scalability, and efficient resource utilization. By creating a cluster of nodes to begin with, you are able to deploy the model on the node and connect it to any existing operating systems to deploy.

5. Workflow:

Google Cloud Workflow is a serverless orchestration service that allows organizations to define, execute, and manage complex workflows. In the context of resume parsing, Workflow can be used to create a streamlined process for document ingestion, parsing, and storage. It also enables you to create Eventarc Triggers from Google Sources trigger endpoints to automate resume uploading process using the correct attributes and operators syntax values. I created a simple Google FireBase Realtime Database as a event provider to test the integration process of Workflow.

6. Compute Engine VM:

Compute Engine VM instances provide the computational power needed to run various components of the resume parsing pipeline, including machine learning models, data preprocessing, and post-processing tasks. These VMs can be easily provisioned and managed through the Google Cloud Console. I experimented using Python to create an automated script that triggers the Workflow as soon as the resume is uploaded into the server.

Creating an Automated Pipeline for Resume Parsing

The process of creating an automated pipeline for resume parsing using Google Document AI and Google Cloud tools involves several key steps:

1. Data Ingestion: Resumes and job applications are submitted by candidates and stored in the Document AI Warehouse. These documents are ingested automatically through application integration with the ATS.

2. Document Preprocessing: Before feeding the documents into the resume parser, it’s essential to perform preprocessing tasks like text extraction, format normalisation, and quality assurance to ensure accurate parsing.

3. Resume Parsing: The core component of the pipeline is the resume parser, which is built and deployed on Google Cloud AI Platform. This parser uses machine learning models to extract structured information from resumes, including contact details, work experience, skills, and education.

4. Data Enrichment: After parsing, the extracted data can be further enriched by integrating external data sources or performing additional NLP analysis to identify keywords and contextual information.

5. Workflow Orchestration: Google Cloud Workflow is used to define the workflow for document processing. It automates the sequence of tasks, ensuring that documents are processed efficiently and consistently.

6. Kubernetes Deployment: The entire pipeline, including the resume parser, preprocessing scripts, and workflow orchestration, is containerized and deployed on Kubernetes for scalability and fault tolerance.

7. Results Storage: The parsed data is stored back in the Document AI Warehouse, making it easily accessible for further analysis and integration with the ATS.

Benefits of an Automated Resume Parsing Pipeline

So, what are the advantages of using the Google Cloud AI platform that could potentially contribute to solving a business use case in a scalable and cost saving way?

Implementing an automated resume parsing pipeline using Google Document AI and Google Cloud tools offers several advantages to organiSations:

1. Efficiency: The automation of resume parsing reduces the time and effort required to screen candidates, allowing HR professionals to focus on more strategic tasks.

2. Accuracy: Machine learning models powered by Document AI ensure high accuracy in extracting and structuring resume data, reducing the risk of human error.

3. Scalability: The use of Kubernetes and Google Cloud infrastructure enables organizations to scale the pipeline as the volume of resumes and job applications grows.

4. Consistency: Workflow orchestration ensures that all resumes are processed consistently, following the same predefined steps.

5. Integration: Seamless integration with the organization’s ATS or job application platform ensures a smooth candidate experience.

Therefore, in a competitive job market of the 21st AI century, automating the resume parsing process is essential for organisations looking to streamline their hiring processes and identify the most qualified candidates efficiently. Google Document AI, coupled with Google Cloud tools and services like AI Platform, Kubernetes, Workflow, and Compute Engine VM, provides a comprehensive platform for building an automated resume parsing pipeline. By leveraging these technologies, organizations can save time, improve accuracy, and enhance their ability to identify top talent, ultimately leading to more successful and efficient hiring practices. As the job market continues to evolve, the automation of resume parsing will remain a critical tool for organizations seeking to stay competitive and make data-driven hiring decisions.

Written by Jessica John Posko

Cloud Technology Associate

Google Cloud