Does PII data influence candidate-to-job matching results? A case study

Oskar J. — Tue, 26 May 2026 09:20:00 GMT

By the Fastr.ai ML Team

Introduction

Privacy regulations are tightening around the world. GDPR in Europe, state-level privacy laws across the U.S., and growing candidate expectations all push the same direction: handle personal data with extreme care. For recruitment technology companies like Fastr.ai, this raises a very practical engineering question - does our matching model treat the PII data as a meaningful signal when doing job-candidate matching? Secondly, a follow-up question - can we strip personally identifiable information (PII) from resumes and job descriptions without degrading the quality of our AI matching?

Photo by Jason Dent on Unsplash

The intuition behind the concern is straightforward. Our matching model reads the full text of a candidate profile and a job description, then projects both into a shared embedding space to calculate relevance. Names, locations, school names, and employer names all may carry contextual signals. Removing them could, in theory, affect the matching process - assuming our transformer matching model (dense retriever) uses this signal for matching.

We decided not to guess, but instead, ran the experiment to investigate it. This post walks through what we did, what we measured, and what the numbers tell us.

Research questions

We framed the experiment around two focused, technical questions.

Q1: Does anonymization significantly influence the performance of the matching model trained on full, raw data?

Answering this tells us whether our current production matching model actually relies on PII - names, locations, employers, schools - as a meaningful matching signal, or whether it treats those entities as background noise. If scores drop once PII is removed at inference time, the transformer model is leaning on identifiers to reach its decision. If the scores stay flat, PII is not a signal the matching model depends on.

Q2: Does the matching model trained from scratch on anonymized data perform on par with a model trained on raw resume data?

While Q1 probes the existing production matching model, the Q2 goes a step further. Even if the current matching model handles anonymized input without degradation, a dense retriever model that has never seen raw PII during training is a different story. If both training regimes land on the same matching quality, then raw vs. anonymized training data becomes an architectural design choice on our side rather than a quality dealbreaker - either variant could be used interchangeably without a recruiter or candidate noticing a difference in the shortlists produced.

Under the Hood (For the Technically Curious)

For readers who want a bit more detail on the methodology:

Named Entity Recognition (NER) is the process of scanning text and identifying spans that refer to specific categories like people, places, or organizations. The open-source approach used a BERT-Large model. Our proprietary model is a spaCy transformer pipeline trained on recruitment-domain text and recognizes nine entity types including first names, last names, locations, employers, schools, degrees, majors, job titles, and dates.

Presidio is Microsoft’s open-source framework for PII detection and anonymization. Once the NER model identifies where personal information sits in a text, Presidio handles the replacement with generic placeholders.

F-Score and NDCG are the two metrics we tracked. F-Score measures how well the model correctly identifies good candidate-job pairs (precision and recall combined). NDCG (Normalized Discounted Cumulative Gain) measures ranking quality — whether the best candidates appear at the top of the list rather than buried somewhere in the middle.

Retraining was performed on Azure ML using our standard supervised training pipeline. Both runs used identical transformer model architecture, hyperparameters, and evaluation splits. The only variable was the training data: original vs. anonymized.

Experiment Design

The Dataset

We started with a production-scale dataset exported from our Data Engine - the internal annotation tool our team uses to label and curate training data. This gold standard dataset consists of candidate-to-job pairs, manually labeled by our annotation team to serve as ground truth for matching quality. Details of Data Engine will be covered in future articles.

Anonymization Approaches

We ran two independent anonymization passes over the candidate texts, keeping job openings untouched:

Approach A: Open-source anonymization

We used a widely available NER (Named Entity Recognition) model from the open-source community paired with Microsoft’s Presidio framework to detect and replace personal information. This setup recognizes people, locations, and organizations, plus we added pattern-based detection for email addresses and phone numbers.

Approach B: Fastr.ai’s own anonymization

We swapped in our proprietary NER model, which was trained specifically on recruitment data. Because it was built for our domain, it recognizes finer-grained categories like school names, degree types, majors, job titles, and employer names - on top of the standard person and location detection. We again used Presidio for the actual replacement step.

In both cases, every detected entity was replaced with a generic placeholder. A candidate named “Jane Smith” became “.” “Stanford University” became a generic school label. The goal was simple: make it impossible to identify a specific person from the text, while keeping the professional substance intact.

What we anonymized

Across both approaches, the following PII categories were targeted:

First and last names
Locations (cities, states, countries)
School and university names
Employer and company names
Phone numbers and email addresses
LinkedIn profile URLs

Plot 1: summary of anonymization approaches

Metrics

We tested two production matching model variants and tracked two key metrics:

F-Score: measures how accurately the model identifies the right candidate-job pairs (the higher, the better).
NDCG: measures how well the matching model ranks the best candidates toward the top of the list (again, higher is better).

Results

The first test was straightforward. We took our existing, already-trained matching models and ran them against the anonymized candidate data without any retraining. If anonymization destroyed important signals, we would see the numbers drop.

Model scores

Below tables present final scores for the evaluation of the matching transformer model, which groups together the two metrics ran against the test data set.

Table 1: Evaluation scores — raw results

Plot 2: Comparison of the evaluation scores

What the numbers tell us

The short answer: anonymization did not hurt matching quality. In fact, F-Scores went up slightly across both matching models and both anonymization methods. NDCG stayed essentially flat.

Why would removing information actually help? Our working hypothesis is that names, locations, and school names act as noise for the matching task. The matching model’s job is to understand professional fit - skills, experience level, career trajectory. Once you strip out the personal identifiers, the remaining text is more densely packed with the signals that actually matter for matching a candidate to a job.

Think of it this way: whether a software engineer’s name is “John” or “” has zero bearing on whether they are the right fit for your open role. The model agrees.

Training from scratch on anonymized data

The evaluation above proved that our existing transformers models handle anonymized input just fine. But we wanted to push the question further: what if the matching model has never seen raw personal data at any point in its life? Can you train a matching model entirely on anonymized text and still get the same quality?

We kicked off two training runs on Azure ML, one on the original dataset and one on the anonymized version, using identical settings for everything else.

Plot 3: Results of the training experiment

The outcome confirmed what the earlier evaluation hinted at. A matching model trained exclusively on anonymized data reached the same level of matching quality as the baseline dense retriever trained on original, non-anonymized text.

Conclusions

For talent acquisition leaders, these results boil down to three practical takeaways:

1. You can protect candidate privacy without sacrificing match quality. Anonymizing resumes before they enter the matching pipeline does not lower the quality of the shortlist your recruiters receive. Stricter data minimization, better compliance posture, and same hiring outcomes.

2. Model is straight to the point. We should highlight that these results also support the conclusion that our matching model, even without anonymization, looks only at relevant parts of the candidate’s resume, and disregards the entities we anonymized in this experiment. Thanks to investment in the highest quality of training data, we make sure our model looks at what is relevant.

3. Additional noise won’t change much nor introduce bias. Our matching model, thanks to our investment in high quality training data, does not use these entities for matching, even if they are present. There is an option for anonymization that can be enabled for the client on request, but according to these results it’s also not necessary.

Privacy and performance are not a trade-off. Our experiment on 8.6 million records proves that you can anonymize candidate data and still deliver the same quality matching that recruiters count on. For Fastr.ai, this is another step toward building AI that is not just powerful, but trustworthy.

Last but not least, our matching does not need anonymization to not look at the PII when making the decision, thanks to the highest quality training data.

If you want to see how our contextual matching works in practice, reach out to schedule a demo at Fastr.ai.

Does PII data influence candidate-to-job matching results? A case study was originally published in Fastr.ai Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.