Torture the data, and it will confess to anything

Role Titles Disambiguation

Kaan Karakeben
Beamery Hacking Talent
10 min readJul 22, 2020

--

Photo by Franki Chamaki on Unsplash

Disambiguation as defined in the vocabulary.com dictionary refers to the removal of ambiguity by making something clear and narrowing down its meaning. Whilst data disambiguation is not an easy task, it is essential for all language processing and is directly correlated to perceived data quality.

In data integration where the goal is to consolidate data from disparate sources into a single homogenized set. The ultimate goal is to provide users with consistent access and delivery of data. However, real-world data is messy, inconsistent, and ambiguous. As a result, it needs to be processed and massaged to maximize its effectiveness. Disambiguation provides a framework in which integrating data and transforming it into a consistent format is scalable. This transformation generally involves creating a common vocabulary and a framework to extract valuable information out of the noise and added complexity.

Data In The Recruitment World

Data in the recruitment world consists of a mixture of structured and unstructured information of varying lengths e.g., job description, curriculum vitae, cover letter, etc. At Beamery, we have implemented various mechanisms to extract and structure information from these textual sources e.g., role titles, skills, experience descriptions, and company names. Among those, role titles stand out as the key that draws the lines around the skills and the knowledge the individual possesses.

Taking advantage of the rich background information around a role title is useful; however, it is also one of the harder information to deal with in the recruitment space. They are ever-changing, prone to typos, can provide additional knowledge such as seniority or location on top of the main phrase defining the work. Moreover, you are likely to encounter synonyms expressing the same set of skills and experience, pointing at the same semantic object. For example, “software engineer”, “software developer” and “software ninja” can be used interchangeably, all representing the same underlying experience. In this work, we are making a distinction between disambiguation and similarity. The following process does not make judgments about the similarities between role titles. The aim is to disambiguate the textual information without losing vital details that shape the expectations from the role title.

The Problem

Before we go into detail, it is important to paint a clear picture of the problem. We have sampled around 10 million distinct anonymous contacts from Beamery’s database. These “contacts” are individuals that have been in contact with our client companies for a job position. Role titles are among the data stored about these individuals. When we created the list of role titles from this sample, we were shocked by the staggering count of 7 million distinct role titles. Our clients hire from different nationalities and countries, so we would expect the inclusion of different languages to compound the final number. However, this still indicates there is a clear data cleanliness problem that may be caused by typos or other technical errors in CV parsing or 3rd party integration systems.

Thinking about reducing the complexity in the role title space, an intuitive approach would be to map the raw role titles to a curated version, preferably from a taxonomy of role titles. This way we would reduce the diversity in 7 million role titles and translate them to their counterparts in a known space. There are public taxonomies available such as the efforts by ESCO and O*Net. It seems enticing at first to take advantage of a work that is incredibly costly both in experts’ time and money. Yet, mapping to a taxonomy proposes a non-trivial search problem and this isn’t really the first step into dealing with role titles. It became clear to us that we needed to deconstruct, clean, and understand the building blocks before moving into mapping or other downstream efforts.

The Solution

The response came as a multi-step disambiguation framework that features a role title vocabulary. There are two main parts to the process: cleaning and feature extraction. Cleaning includes basic preprocessing, spelling correction, and token removal by discarding out of vocabulary words. The outcome of the cleaning step is the “disambiguated title”. Feature extraction is focused on transforming a string into a set of features such as seniority levels and a set of phrases.

Vocabulary

In the core of the disambiguation process lies the role title vocabulary. Vocabularies allow us to define the boundaries of an entity by offering a set of acceptable building blocks. In this case, words are the building blocks for role titles. Common words such as manager, director, senior, specialist are the usual suspects. We also have words that represent expertise. For example, “scientist” is a broad term that defines a set of skills. “Research scientist” indicates that the role likely to belong in academia whereas a “data scientist” is likely to work in a commercial company. All of these words are modifying the meaning of the role title heavily. But how would we decide whether a word belongs in a role title and its presence enriches the role titles meaning? We have chosen commonality as the acceptance criteria for the vocabulary. Selecting a count threshold for acceptance of a word is a balancing act. Lower count threshold means a bigger vocabulary, resulting disambiguation will be less aggressive but it will have a higher coverage. A higher threshold means a smaller vocabulary. Coverage will suffer but disambiguation can be more robust. It carries the risk of inviting many false positives.

We have created a role title vocabulary of 8,330 words with a threshold of 100. It covers 96% of role titles that have a frequency of at least 5 times out of 28 million instances. The main assumption here is that any word that is not a part of the vocabulary is either noise, typo (if we failed to correct it), or so obscure that downstream models/processes cannot make sense of it.

Disambiguated Title

An important step in the cleaning process is creating a “fingerprint” of the role title. After preprocessing, spelling correction, and token removal, we create an ID of the role title with remaining tokens. We have been influenced by the simple but efficient approach taken in clustering in OpenRefine that is simply ordering the words alphabetically and keeping only the unique ones. Using the “fingerprint”, we can group role titles sharing the same ID and assign a “disambiguated title” by taking the most common version.

Sharing “Senior Data Scientist” as “disambiguated title”

Phrase Detection

Role titles have different components and can include a few roles separated listed together as in “Founder and Chief Executive Officer”. If we were to aim for a one-to-one mapping to a curated list of role titles, we would always lose vital information or would be forced to make arbitrary decisions to choose one role title to map to.

Instead of mapping raw role titles to a curated set of role titles, we have chosen to deconstruct the role title to list of words and phrases. Understanding the vocabulary will enable us to work with any role title spanned by these entities. This is very similar to the way a human would understand any written text. Instead of mapping every possible sentence to an instance in our memory, we learn the words and the grammar. This is the type of understanding that will be enabled by our process.

To capture the phrases, we have trained an n-gram language model to qualify phrase candidates found in the role titles. The example role title below holds pockets of information that contain valuable information in assessing the expertise of the individual. We could map this role to a “Vice President” but in the process, we would lose most of the context. Instead, we are qualifying the phrases and keeping them as a set of tags. Features extracted from this role title, seniority, phrases, and the disambiguated title together would capture the complete context but we could use any of the features individually depending on the nature of the downstream solution.

Role Title Disambiguation Process

The diagram below shows the different steps in the disambiguation process.

Role Title Disambiguation Process

Start with the raw role title

Detect the language of the role title

Every other step in the process depends on the language of the role title. Starting with spelling correction, a vocabulary of words and phrases, and seniority dictionary changes with the language. Therefore, detecting the language at the beginning is vital.

Preprocessing includes dealing with non-Latin characters, expanding acronyms, removing punctuation, and removing whitespace.

The spelling correction step allows us to catch any spelling errors before we look for the vocabulary words in the role title.

Token removal works on the assumption that words that are left out of the vocabulary are irrelevant to the granularity we are aiming for.

Fingerprinting is creating a unique representation for the role title that will be as an ID.

Disambiguated title is a clean version of the role title and it’s shared by all the role titles sharing the same fingerprint.

Seniority detection is the process of looking for seniority terms inside the role title. If found, seniority is extracted as a new feature from the role title.

Phrase detection step makes use of an n-gram language model to assign probabilities to word groups and qualify them in their ability to represent the role title.

After completion of the steps gives above, we end up with a list of features for a role title. Instead of establishing a one-to-one mapping, we have created a structure that captures the information available in the role title. Using this clean data, we can move to a structure where extracted entities are points in a vector space where we can infer relationships between them, getting us closer to achieving the “understanding” that we are looking for.

Evaluation

Concepting and creating such a process is valuable; however, adoption and consistent value creation depend on proving value and improvement. Such a process with many rules of varying complexity requires a lot of care. The identification of edge cases and failings is very significant. The stakeholders need to acknowledge that this is an iterative process. The failed edge cases will be input to learnings and over time the output quality will increase.

For this reason, we have created an internal evaluation UI. This allowed us to recruit a group of testers and ask them to go through a set of role titles and check the outcome of the modules at each step. The feedback exposes the potential shortcomings of the process but equally importantly it gives us a ground truth set for quantitative testing. This way, we can measure the performance of individual modules every time we release a new version.

Further Work

We are aware that rule-based modules cannot always capture the complexity of human-level tasks. However, the current performance gives us a competent baseline to beat. Depending on the criticality of the tasks and the performance expectations we will prioritize the improvement efforts.

A possible addition to the process is the detection of different entities such as location, company names, software/technology names. Currently, we are choosing to discard location and company names from the role title vocabulary. However, with enough training data, we should be able to train a performant named entity recognition model that can recognize seniority terms as well.

The spelling correction module depends on a select dictionary of words and phrases. However, we can get better results if we are to leverage a multilingual dataset of spelling mistakes. If we fail to correctly assign the language of the role title, we start incorrectly processing foreign language words for spelling correction as they are not present in the vocabulary.

Another important improvement area is phrase detection. We have started with a baseline model to score the phrases; however, language modeling is one of the most popular research areas. As long as we have a large enough dataset of role titles, we can allow deep networks to learn the grammar dictating the structure of the role title and the semantic world behind it. Yet, context and progression of role titles can be harder to capture.

Conclusion

This is our response to the diversity and noise in role title space. Iterative improvement is at the heart of this process and such an effort needs time to mature. This is one of the earlier steps in the data journey to build a platform to base future efforts. It would certainly help to contextualize the data problem with the problems of the business in order to keep it prioritized and supported. In our experience, we have seen that the business highly supports the disambiguation efforts as long as the context and the nature of the solution is well communicated.

We capture the gist of the solution under the umbrella term “disambiguation”, however, we respond to many different problems with every module. It’s very significant that every module gets enough attention in evaluation and improvement. In the end, a chain is as strong as its weakest link.

We hope that our story in creating a disambiguation process can inspire you to address similar problems. In the series that follows we will continue with posts regarding our progress and provide in-depth information on the individual modules.

--

--