Measuring Workforce Greenness via Job Adverts: Part 1, Industry

Published in

Data science at Nesta

7 min readDec 5, 2023

At Nesta, we’re measuring the greenness of every job, business and sector to help facilitate the transition of the UK workforce to net zero.

Central to this research is our ever-growing database of online job postings. With the permission of job boards, we’ve been scraping posts since 2021 and we’ve collected almost 8 million postings so far.

As part of our research process we’re sharing our models, methods and tools for anyone to use, and to give an opportunity for feedback on how we might improve them.

This is the first in a series of methodological blog posts on how we are trying to measure labour market greenness.

How are we measuring greenness via job adverts?

We’re aiming to gauge the greenness of every job advert via three components:

occupation
industry
skills.

*An output visual of job advert* greenness across skills, occupations and industries

Across each component, we:

extract relevant information from a given job advert
standardise the extracted information to official, government-released standards and
join the standardised information to publicly available datasets that report on occupation, industry and skill greenness

*A high level methodological framework of our greenness approaches*

However, the information we need to extract from job adverts, the standards we want to map onto, and the publicly available datasets we want to use, all differ from each other. As a result, the methods we need to use to generate a multidimensional picture of greenness differs for each component (occupation, industry, skills).

In this first instalment of the series, we’ll focus on how we calculated the greenness of the industry component in each job advert.

Using standardised industrial classification codes to assess job ads

To understand the green labour market from an industry perspective, we needed to first assign standard industrial classification (SIC) codes to job adverts.

SIC codes are the Office of National Statistics (ONS) classification framework for economic activities. They’re hierarchical and range from manufacturing and sustainability consulting to recruitment.

*Example SIC codes from the Office of National Statistics (ONS)*

In addition to defining SIC codes, the ONS also regularly reports on SIC level statistics, from earnings and hours worked by industry, to industry turnover and employment size. Most relevantly, the ONS recently released experimental estimates of the green labour market, including emissions per employee by SIC code.

By assigning SIC codes to our database of job adverts, we’re able to link each job to existing measures of industry greenness, such as SIC-level greenhouse gas (GHG) emissions.

How we assigned SIC codes to job adverts

When someone registers their company via Companies House, the UK government agency responsible for incorporating all forms of UK companies, they must declare its SIC code. In other words, they must declare what type of business they’re starting.

Although this data is publicly available via an API, there are many issues with data quality. These range from user input error, to company changes after the point of registration to the over use of vague SIC codes like “other business support activities not elsewhere classified”.

Although this serves as a reasonable baseline approach, we wanted to introduce a more robust methodology to assigning SIC codes to job ads.

We therefore identified the greenness of the industry component of job ads by:

extracting company or industry information from job ads by training a supervised machine learning model
standardising job ads to SIC codes by identifying the most semantically similar SIC code descriptions to the industry information extracted and
joining SIC codes to SIC level greenness datasets, including the ONS’s total greenhouse gas emissions, greenhouse gas emissions per unit of economic output and carbon emissions per employee by industry.

*Pipeline summary of assigning SIC codes to job ads*

Let’s dig into those steps in more detail.

STEP 1: Extract industry information from job ads

Job ads often include information about the company hiring for the role. This information can provide clues as to the industry the company sits in. In order to extract sentences with industry information, we wanted to train a model to predict whether a sentence was a company description or not. To do so, we needed to generate labelled data and train an appropriate model for the task.

Generating training data

Our first step was to generate a sufficient number of labels to use as training data for our model. We also wanted to make use of large language models (LLMs) to expedite the labelling process.

Therefore, we chose to use Explosion AI’s Prodigy as our annotation tool. We also used LangChain, a Python framework to create LLM applications, and OpenAI’s gpt-3.5-turbo to predict labels.

Although at the time of development, Prodigy had a series of built-in OpenAI recipes, we wanted a bit more flexibility with the gpt-3.5-turbo responses. As a result, we opted to build our own custom Prodigy recipes using LangChain.

We prompted OpenAI’s gpt-3.5-turbo to identify the start and end spans of company descriptions in job ads and to return this information in a JSON format to feed into the custom Prodigy recipe.

Using both tools, we were able to seamlessly build a custom labelling instance.

*Our custom labelling front end, making use of Prodigy and LangChain*

Instead of combing through hundreds of job ads to manually identify company description sentences, our task was simply to correct LLM-predicted company description sentences.

The lighter cognitive load allowed for faster labelling and more labels to train our smaller model downstream. As a result, we were able to label 450 job adverts for company descriptions.

2. Training a model

Once we generated training data, we wanted to choose an appropriate model architecture to train.

Given the many different ways companies in different sectors describe themselves, we ultimately opted to fine-tune a classification head of a BERT transformer model to adequately capture lexical nuance. In particular, we used the jobbert-base-cased model –a continuously pre-trained BERT model based on millions of sentences from online English job postings.

The model was trained on 486 manually labelled company description sentences (as some job adverts had more than one company description sentence) and 1000 non-company description sentences fewer than 250 characters (this character limit was chosen to accommodate for poor sentence splitting cases). On a held out test set of 147 sentences, it achieved the following metrics:

The code to train the model is openly available here. Alternatively, the model itself is hosted on HuggingFace hub and can be accessed there.

Once we had a model to extract company descriptions from job adverts, our next step was to map them to SIC codes.

STEP 2: Mapping company descriptions to SIC codes

A key issue with SIC data is that the codes themselves don’t have descriptions. For example, the SIC code “R” is named “arts, entertainment and recreation”. Although signal certainly exists from the SIC names themselves, they are short, often contextless and are described quite differently from company descriptions.

To combat this, we synthetically generated SIC company descriptions to map onto instead of using the SIC names themselves.

We did this using the Python library LangChain to prompt OpenAI’s gpt-3.5-turbo to generate company descriptions of SIC names. A subset of SIC company descriptions were then manually reviewed to ensure that the descriptions appeared appropriate given the SIC names. This yielded a synthetic dataset of SIC codes, SIC names and SIC company descriptions.

*An example SIC company description for the code arts, entertainment and recreation*

With our synthetic dataset in tow, we were able to convert both the job advert and SIC company descriptions into numerical representations and calculate the cosine similarity between the vectors to identify the most semantically similar SIC codes. We used the pretrained sentence transformers model “all-MiniLM-L6-v2” to embed the sentences and created a queryable vector database of SIC company descriptions using FAISS.

A match was made if the distance between the job advert company description and closest SIC company description was greater than 0.5. If the distance was less than 0.5, the top SIC company description distances were taken and if there existed a majority SIC code with an average distance of above 0.3, this also yielded a match. Otherwise, a SIC code was not returned.

To evaluate this mapping approach, we labelled the quality (“bad”, “ok” or “good”) of 335 job adverts matched to SIC codes. We found that overall 71% of matches were either ok or good.

Qualitatively, the approach appears to do well when the company description is explicit about the industry and contains multiple sentences. It does less well when the company description is vague (eg, “we are a company committed to the future”) or when a term has multiple meanings (eg, “Juice is recruiting for a manufacturing company” which was mapped to a food related SIC code).

Future Steps

We’ll continue posting methodological updates so watch this space if you’d like to learn about our approach to extracting occupation-level and skill-level greenness data from job adverts.

If you have any questions or ideas for how to improve our methodology, feel free to respond to this blog.

The full codebase is open and can be found here. Code relevant to SIC mapping can be found here.

Measuring Workforce Greenness via Job Adverts: Part 1, Industry

Using standardised industrial classification codes to assess job ads

Written by India Kerle