Beyond Keywords: How AI and Semantic Search Can Help Canadian PR Applicants and Job Seekers

9 min readMar 6, 2024

TL;DR: I built a semantic search and RAG web app to help permanent resident applicants navigate a tedious yet crucial part of the application process. You can access it here.

The Problem

Becoming a Canadian Permanent Resident is a significant milestone for immigrants wishing to make Canada their new home. Most current PR holders will remember that feeling of eagerly (yet carefully) tearing open the official envelope from Immigration, Refugees and Citizenship Canada (IRCC), skipping past the information leaflet and prying off their shiny new PR card — a powerful piece of plastic that marks an end to standing in long immigration queues at the airport and, more importantly, moves us a little further down the journey to attaining Canadian citizenship. Unfortunately, like most things worth celebrating, this big win does not come easily and this one in particular lies at the end of a lengthy and tedious application process.

One component that is an integral part of assessing eligibility in at least two of the major immigration streams, is the requirement to select an appropriate National Occupational Classification (NOC) code that best matches an applicant’s work experience. For those unfamiliar with the term, the NOC is a standardized system that assigns a code to every single occupation that exists in the Canadian labour market. It describes each job according to the training, education, experience and responsibilities (TEER) needed to work in the job. Some immigration streams specifically limit entry to those who have experience in occupations identified by specific NOC codes. To raise the stakes even higher, selecting the wrong NOC code could lead to refusal of your PR application.

Okay, I get what an NOC is and why it’s important to find the right one. So what makes finding the right NOC difficult?

Glad you asked — Here’s why in a nutshell:

The National Occupational Classification is based on a five-tiered hierarchical structure. The last level of the hierarchy comprises 516 unit groups. This means that there are 516 unique NOC codes each identifying a unique job profile and associated job duties. That last level is where we’ll need to search for our NOC code.
The current search functionality on the official NOC website is limited to searching by ‘Title’ or ‘NOC-code’. Yes, you could use advanced options like the hierarchal navigation tree but even then there are a couple of problems with this: 1) The job title is not really important — For immigration eligibility purposes, you have to have performed most of the ‘main duties’ listed under that specific job title. In some cases, your previous work experience might actually fall under a couple of different NOC codes, or your official job title might be associated with a NOC code with main duties that don’t actually match your experience. 2) How would I go about searching by NOC code when that’s exactly what I’m trying to find in the first place?
Keyword search is available but it often feels like trying to find a needle in a haystack — there may be a number of NOCs that have that same keyword mentioned somewhere in the associated job profile data. The title “Manager” could apply to a wide range of roles across different industries and NOC codes, each with its unique set of duties and responsibilities. Relying solely on job titles or narrow keywords can lead you down the wrong path, potentially aligning you with a NOC code that looks right on the surface but doesn’t accurately reflect your professional experience.

“There must be a better way ” — Author unknown but probably a frustrated user

The Solution

If you’ve read this far, hopefully you’ll appreciate that finding the right NOC code comes down to matching your job duties to the main duties of one of 516 occupations in the NOC database. You may think then, that searching by job duties would be a logical next step — and it would be. However, there are many different ways of phrasing the same job duty. For example, one of the duties of a software engineer may be:

“Design, develop, and maintain software applications based on client requirements.”

This same duty could alternatively be phrased as:

“Create and implement software solutions in accordance with customer needs.”

Despite the difference in wording, both phrases describe the same fundamental responsibility of a software engineer. If we ran a search for matching job duties based simply on keywords or phrases, we may potentially overlook completely valid NOCs with variations in the phrasing of a given query job duty. This limitation necessitates a more advanced approach that can capture the essence of a given job duty beyond it’s superficial lexical composition.

Enter sentence embeddings:

This is a gross simplification and I won’t go into all the underlying detail for fear of losing less technical readers too early, but imagine if we could represent the complexity of a sentence’s meaning as an array of numbers — a vector — that we could plot as a single point on a graph. This is essentially what a sentence embedding model allows us to do. In fact, let’s see an illustration of this concept:

A graphical illustration showing an example of using sentence embeddings to represent different job duties on a graph — A graphical illustration of how different job duties may be represented as points on a 2-dimensional axis using sentence embeddings

The images above shows how different job duties may be converted into vectors on a simple 2 dimensional axis using sentence embeddings. Let’s assume that point (3.8 , 2) is the vector representation of an input query into the search, while point (1 , 9) and point (4, 1.2) are the vector representations of existing job duties in the NOC database associated with a video producer and software engineer respectively. Notice, how the vector representing the input query is significantly closer to the vector representing the software engineering job duty than the vector of the video production job duty. Because sentence embeddings allow us to transform the rich, nuanced language we humans communicate with into numbers that computers can understand and process, we can quantify exactly how similar two sentences — in this case job duties — are, based on calculating the distance and angle between their vector representations.

Okay, you get the basic principles behind sentence embeddings. Let’s see a demonstration of the search tool I built:

To explain what’s happening at a high level:

The user enters two job duties as input into the search. Specifically they enter : “Create molds of patients’ oral structures, including teeth, gums, and jaws, for accurate denture fitting.” and “Oversee the design and fabrication of dentures, guiding the work of technicians involved in the construction process.”
The system creates embeddings of these queries (transforms them into vectors) and compares them to embeddings of job duties from the 516 NOC job profiles.
The system returns the top 3 matches. A user can use the arrow keys, swipe or click on the tab headings to view the different matches. Each tab contains a link to the official NOC webpage for that job profile and the list of main duties specified by the NOC job profile.

Before seeing the results of the above search, the majority of us might’ve naively identified the query job duties as those of a dentist. However, it would appear that the NOC system has a more granular categorization in this particular case — ‘denturists’. Here’s hoping I’m not the only one to whom this wasn’t immediately apparent.

What about the ‘Compare with AI’ button at the bottom of the results?

Up to now, the tool has helped a user narrow down their search to the top 3 matching job profiles in the NOC database. But this is far from a perfect outcome and there’s more we can do to help. As you might see from the above video, some of the matching job profiles in the results have a long list of job duties. Additionally, the job duties the user entered may be separated across one or more of the job profiles in the results. A user now has to read each matching job profile and then decide which profile has job duties matching the majority of their input job duties. Making this decision can be particularly challenging for those whose primary language is not English, which may be the case for a significant number of users of this tool. Here’s where Retrieval Augmented Generation (RAG) and Large Language Models can help.

When you think of LLMs, ChatGPT is likely to be the first one that springs to mind. For the majority of readers, this may have been the first LLM you’ve interacted with. However, not long after its release, the AI landscape has seen a number of capable newcomers enter the chat (sorry — I couldn’t resist the pun). One that has piqued my interest and that I wanted to experiment with for this project was Cohere’s command model. With LLMs proving themselves capable in a number of versatile, time-saving tasks, naturally, we would ask — why don’t we just ask a LLM to give us the NOC that best matches our list of input job duties.

Here’s an experiment I ran using Cohere’s model playground:

I gave the model a job duty and asked it to tell me what NOC code it would fall under. Take a look at the model’s response in the OUTPUT section at the bottom. Now, while you may be initially surprised that the command model knew enough about what an NOC code was to produce a reasonable looking answer, after searching through the official NOC website for code 5611, you’ll find that you hit a wall — this particular code does not exist! The model has convincingly fabricated a number and a group title that, to an unsuspecting user, may easily pass as a real NOC code on first inspection. The same experiment with chatGPT showed similar results. While GPT was able to come up with a real job profile name, the NOC code does not exist in the latest NOC classification. This experiment outlines two crucial shortcoming of LLMs , 1) they are susceptible to hallucination i.e they make up information and 2) the accuracy of their output is greatly limited by the freshness of their underlying training data.

Retrieval augmented generation is a way of improving or ‘augmenting’ the generated text output of a language model by grounding it in information retrieved from a source of knowledge outside its training data. More concretely, we can instruct the model to run the search function first and then compare the input job duties with the job duties of the top 3 matching job profiles.

Here’s a diagram to illustrate this:

And here’s a video demonstrating this functionality in the tool:

Notice that by using RAG we are able to get the LLM to reference the retrieved search results when helping a user decide which of the top 3 results have the most matching job duties. In this way we are able to point the model towards an up to date source of knowledge that it can reference when producing a response.

There’s more to the search function than I’ve gone into detail about here but perhaps I’ll save the technical deep dive for another blog post. For now I hope I’ve left you with an appreciation for how semantic search and LLMs have transformed our ability to retrieve and process information.

Want to give it a try? Access it here:

https://noc-finder.streamlit.app/

Closing remarks:

When I initially started this project, it was intended to support Canadian PR applicants. However, along the way, I recognized its potential for a wider audience — job seekers. Imagine a tool integrated into platforms like LinkedIn, where you could tell an AI assistant about the work you’re passionate about, and it could then match you with job titles by understanding the nuances in job descriptions. This idea extends beyond merely finding a job; it’s about helping individuals pinpoint roles that truly resonate with their interests and passions.

Thank you for staying with me through this exploration. If you have suggestions for enhancing the app, or if there’s an interesting idea you’d like to collaborate on, I’d love to hear from you. Please reach out to me at rayjayatunga@gmail.com