Measuring Workforce Greenness via Job Adverts: Part 2, Occupation

How we link green occupation data to our dataset of job adverts.

Published in

Data science at Nesta

8 min readJan 15, 2024

In our first instalment of a series of posts about green jobs, we explored how at Nesta we are measuring the greenness of the labour market using job adverts. Our greenness measures cut across three components:

Occupation
Industry
Skills

Our previous post discussed how we used Large Language Models (LLMs), fine-tuning and synthetic data to measure green industries. In this second installment, we will discuss how we calculated the greenness of the occupation component in each job advert.

This process involves standardising the job title given in the job advert to a Standardised Occupational Classification (SOC) code, and then joining this SOC code to datasets of occupation greenness measures.

Although this post will focus most on the standardising task (as this is the hardest bit), we will also mention some of the green occupational datasets used, and show some results.

Assigning Standardised Occupational Classification codes to job adverts

There is noise and spelling mistakes in the way job titles are written in job adverts. For example, “Assistant Nurse”, “Nursing assistant”, “Assistent Nurse — London” could all be describing the same occupation.

Therefore, before we can join job titles to occupation greenness datasets, we need to assign each of our job adverts a standardised occupation. The UK’s standard way of doing this is using Standardised Occupational Classification (SOC) codes. SOC codes were designed by the ONS, with the most recent version updated in 2020. For each occupation there is the 4-digit SOC or a more granular 6-digit extended SOC.

The ONS SOC dataset is downloadable from the ONS website and gives the SOC codes of 29,902 job titles. A simplified sample of this dataset is given below:

To standardise each of our job adverts to a SOC code, we use a 5-step process to find the SOC code of the most semantically similar job title from this list of 29,902 job titles. These steps are outlined below.

Overview of the steps used to map a job title to a Standardised Occupational Classification (SOC) code.

Step 1: We clean the inputted job advert. This cleaning involves removing words which describe the job conditions but not the job title; e.g. removing common place names or words such as “part-time”.

Step 2: We process the full ONS SOC dataset to create unique job titles to SOC information. (For those familiar with this dataset — we combine the “INDEXOCC NATURAL WORD ORDER”, “ADD” and “IND” columns of these dataset to create unique job titles).

A small sample of this is:

{
  "data scientist": ("2433/02", "2433", "2425"),
  "data analyst computing": ("2134/99", "2134", "2136"),
  "data analyst": ("3544/00", "3544", "3539"),
  "analyst developer": ("2134/03", "2134", "2136"),
  "meter assembler": ("8149/00", "8149", "8139"),
  "motor assembler electric": ("8141/00", "8141", "8131"),
  "motor assembler engineering": ("8142/02", "8142", "8132")
}

Where we have the unique job title followed by the SOC 2020 6-digit code, the 2020 4-digit code, and finally the 2010 code (which will come in handy later on!).

Step 3: We embed these unique ONS job titles and the input job title using the all-MiniLM-L6-v2 Sentence Transformers pre-trained model.

Step 4: We then calculate the cosine similarity between the embedded input job title and all the embedded ONS job titles. In our running example, these values are:

* For the purposes of simplifying this running example this value has been changed to 0.50 from 0.47.

Step 5: Finally, we find the SOC information for the ONS job title with the highest similarity as long as it is over a certain threshold (our default is 0.67). If there is no single job title with a particularly high similarity, then we use a consensus approach at the SOC 2020 4-digit level (the default to having over 3 matches with over 0.5 similarity score). With the default values, the final matches for each inputted job title would be:

Thus, this algorithm matches inconsistent or unclean job titles to standardised codes; “Assistant Nurse”, “Nursing assistant”, “Assistent Nurse — London” all get mapped to the SOC code “6131/99 — Nursing auxiliaries and assistants n.e.c.”.

Thresholds and evaluation

There are over 400,000 unique job titles in our dataset, the most common of which with over 62,000 job adverts is “Care Assistant”, however 84% of the unique job titles only occur in one job advert (e.g. ‘Casual Bilingual Assistant Cantonese’). We want our algorithm to perform particularly well on the most common job titles in our dataset, but acknowledge that these are also likely to be the most generic and standardised job titles. Therefore, to balance evaluating the most common job titles with the long tail of infrequent job titles, we evaluated our SOC matching algorithm on two datasets:

The most common 300 job titles in our job advert dataset (which account for 22% of the job adverts).
A randomly selected sample of 200 job titles from our job advert dataset

Setting a threshold similarity score

As mentioned in the steps to our algorithm, we match to a 6-digit SOC code if the similarity score is over a certain threshold. To find the default value for this threshold, we used the evaluation datasets.

We found the most likely 6-digit SOC match for all the job titles in the evaluation datasets and manually evaluated whether we thought the match was poor, OK or excellent. We weren’t able to perform this evaluation every time due to lack of domain knowledge, but out of the 459 job titles we could confidently quality check, we thought 72% were excellently matched to a 6-digit SOC, 16% were OK matched and 12% were poorly matched.

Match quality and cosine similarity scores of matches to 6-digit SOC codes.

The similarity scores were generally quite high even if the match quality was poor. However, since most of our matches were OK or excellent, we opted to set a lower threshold value of 0.67. With this threshold value we estimate that 92% of the predicted 6-digit SOC matches are likely to be correct.

Evaluation

With our similarity score threshold of 0.67 we could then properly evaluate the full SOC mapping pipeline. For this, we also manually coded how well we thought any matches to 4-digit SOC were as well.

Of the 300 most common job titles:

90% were mapped to 6-digit SOC;
4% to 4-digit SOC and;
7% couldn’t be mapped to any suitable SOC.

These numbers for the 200 randomly sampled job titles were 56% mapped to 6-digit, 5% to 4-digit and 36% not mapped.

In the most common job titles we find the quality of the SOC match is excellent or OK 92% of the time for those job titles matched to 6-digit SOC, and 100% when matched to 4-digit SOC. For the random sample of job titles, the quality of the SOC match is excellent or OK 89% of the time for those job titles matched to 6-digit SOC, and 90% when matched to 4-digit SOC.

Evaluation results when mapping job titles to SOC codes.

Some examples of different quality matches are given in the table below.

Occupation greenness data

Finally, after being able to estimate SOC codes for most of our job adverts, we were then able to link them to datasets on occupational greenness.

A core dataset on the greenness of occupations was created by the US’s Occupational Information Network (O*NET). In 2009 O*NET curated a list of green occupations by reviewing publications and other sources on workplace topics relevant to the green economy [1]. Through this research and building on work they’d already completed, 215 occupations were classified into three occupational categories; green increased demand (64 occupations), green enhanced skills (60 occupations), and green new and emerging (91 occupations). Although this work is linked to the US economy (and therefore the US SOC coding system) and quite out of date, it still is widely used in understanding the UK’s green economy. For example, GreenSOC developed by the Warwick Institute for Employment Research is influenced by it [2], and the Greater London Authority (GLA) used it in an analysis of London-based green jobs [3].

Data provided by the GLA allows us to map UK SOC (2010) to the O*NET US SOC taxonomies, therefore finding green occupational categories for each UK SOC. However, this method is slightly flawed since there isn’t a one to one mapping between SOC 2020 codes and SOC 2010 codes. We also utilise a dataset of green timeshares provided by the ONS [4] — this is also based off O*NET, and gives estimates of the fraction of time spent doing green tasks per SOC 2010. Finally, we use a dataset of green topics provided by more recent work by O*NET [5].

Thus for the SOC code “2433/02 — Actuaries, economists and statisticians” our green measures include:

O*NET green category: Green New & Emerging
O*NET green binary (Green / Not Green): Green
Fraction of time spent doing green tasks (ONS): 12.5%
Number of O*NET green topics: 39

Analysis of green occupations

Our algorithm allows us to investigate the greenness of occupations in the UK when we run it on a sample of 1 million job adverts. Some broad findings include:

SOC codes could be assigned to 82% of the job adverts.
The most common occupation in our dataset is “Bookkeepers, payroll managers and wage clerks” (SOC 4122/99) which makes up 2.4% of the jobs in our dataset. The job titles ‘Accounts Assistant’, ‘Management Accountant’, ‘Assistant Accountant’, ‘Assistant Management Accountant’, and ‘Accountant’ make up over 50% of the job titles assigned to this SOC code. This occupation has 0% of time spent on green tasks, and therefore one of the least green occupations.
The occupation with the highest green timeshare is “Production managers and directors in manufacturing” (SOC 1121/00) with 66.7% of the time spent on green tasks. In our dataset 0.5% of our job adverts are for this occupation, where the job titles ‘Service Manager’, ‘Production Manager’, ‘Product Owner’, ‘Engineering Manager’, and ‘Production Supervisor’ make up around 35% of the job titles assigned to this SOC code.
The occupation with the second highest green timeshare is “Sustainability officers” (2152/05) with 62.5% of time spent on green tasks. 0.06% of our job adverts are for this occupation, where the job titles ‘Sustainability Consultant’, ‘Sustainability Manager’, ‘Senior Sustainability Consultant’, ‘Graduate Sustainability Consultant’, ‘Head of Sustainability’ make up around 30% of the job titles assigned to this SOC code. There are also 55 O*NET green topics associated with this occupation including ‘Air quality’, ‘Energy efficiency’, ‘Green recreation’, ‘Sustainability’, and ‘Water resources’.
The region with the highest average green timeshare of 7.3% is Leicester, Rutland and Northamptonshire.
The region with the lowest average green timeshare of 4.9% is Lancashire.