Research Hub Updates: Skill Extraction

Is Facebook a job skill? It depends.

This post regards a recent update to the Data@Work Research Hub, a collection of public datasets about the US labor market. It is produced by the Open Skills Project, an initiative based at the Center for Data Science and Public Policy at the University of Chicago. If you’re unfamiliar, check out our introductory post here.

When we launched the initial version of the Data@Work Research Hub, we had a problem.

A snippet of job titles and their associated skills in the Research Hub

There are a few highly visible problems here. Why are Facebook, Twitter, and Instagram so popular? What is Forth? To understand why these show up so often, it helps to explain how skill extraction works in the first version of the Research Hub.

Naive Skill Extraction

The initial version of the skill extraction for the Research Hub is rudimentary. We are building off of the commonly-used ONET jobs and skills taxonomy, which is updated gradually based on survey data over time. For skills, ONET catalogs a mixture of hard skills, soft skills, and tools under a variety of categories they call Knowledge, Skills, Abilities, and Technology Skills. For the purposes of extraction, we pool these together into a broader list that we refer to as skills.

We take this combined list (about 20k in total, most of them technology skills) and look for members from the list in each job posting text. If one shows up, we tag it. This works in many cases, but sometimes the text shows up in job postings for other reasons, such as a link to more information about the company (facebook) or as a common word (forth).

How do we fix this?

Figuring out whether or not a matched word or phrase from the ONET taxonomy is a real skill in this context is not a simple problem. We are working on more robust solutions to this problem, which I’ll mention later in the post. To produce a much smaller list whose precision we can be more confident in, we can as a short-term fix start filtering by the job posting’s SOC (occupation) code.

These job postings also have a code for the occupation. Our data partners classify the SOC code of the posting and send it to us. ONET publishes their list of skills per SOC code. By using this information, we can filter down the list of skills we found to what ONET considers skills relevant to the occupation. We will no longer indicate that (all jokes about work procrastination aside) cost accountants need to be well-versed in Facebook, but marketing managers still do. We also will not tell you that cashiers, RNs, and English professors need to know a programming language invented in 1970.

Is this a good solution?

By using this approach, the list of skills we get is quite small. So small, in fact we rarely extract any skills from the job postings. This solution pins us to what ONET currently knows about, when in reality we want to work towards extending ONET’s taxonomy by using insights gleaned from job postings.

We are improving precision at the expense of recall, and this highlights something that is important to understand: the language used in ONET rarely matches the language used in job postings. The skills are in there, and we need another way to extract them, but as a stopgap let’s be more picky.

If you really want the old version, it’s archived on the Research Hub.

What’s next?

We’re working on some far better solutions for generating phrases that we think may be skills, untethered from the current ONET taxonomy: rule-based methods, word embedding, and conditional random fields. We’re also working on a web labeling solution to more effectively gauge precision and recall of various methods.

You’ll hear much more about these new approaches in later blog posts. Until then, don’t use version 1 of the Research Hub as an excuse to spend hours a day at work on Facebook: the US economy will thank you for it.