Categorisation in Open Data Scotland

Karen Jewell
4 min readJan 7, 2023

--

In yesterday’s refresh for Open Data Scotland, there is a minor update to the way we categorise data sets. It’s not a new feature, it’s just a change to the way we’ve been doing things. Previously, we used to categorise datasets on the back of publisher-provided categories, but now we’re changing this to categorise based on what is given in the Titles and Descriptions of the dataset. This should in theory give us more accurate categorisation because:

  1. Not all datasets come with tags to begin with, so this way we’re reaching more of the “unknowns”
  2. We’re using the keywords and phrases naturally used to describe the dataset, rather than what has been assigned to them in the context of the original corpus
  3. We’re taking away accidental associations used in the original taxonomies which may not apply in the context of Open Data Scotland.

There is a mildly funny story which comes with this change. I’d already cooked this decision up as far back as in our first milestone 2021 Q4 — I had it all planned out, and I knew exactly what I needed to do. I knew it in such great detail, that somewhere along the line I’d convinced myself this was already how we categorised datasets. Until, to my horror, it was pointed out to me that that was not, in fact, the case. So naturally, this went to the top of my Christmas Sprint just so I wouldn’t lose my mind.

A screenshot of the slack conversation in which Karen finds out she has imagined the categorisation function in her head
The thread that started to unravel

But even with that detailed plan in place, I’d soon find out it wasn’t going to work anyway. So there are some modifications to the original plan.

The original plan included tokenising all the text into individual words and potentially stemming them down to root words, so we could do better matching. But in practice there are phrases, or combinations of words, that are more meaningful together than they are separate, so it did not make sense to break them all up.

Examples of these are “green belt” and “intensive care” where together they have more meaning in the context of natural environments and healthcare than “green”, “belt”, “intensive”, or “care” would ever do separately.

In the end, the brief of it is:

// Examples of keywords and phrases mapped to a category.
// If any of these keywords/ phrases exist in a dataset's Title or Description,
// the dataset is assigned that Category

"Education": [
"primary schools",
"education",
"eductional",
"library",
"school meals",
"schools",
"school",
"nurseries",
"playgroups",
"pupil",
"student",
"early years",
"early learning",
"childcare"
],
"Law and Licensing": [
"law",
"licensing",
"regulation",
"regulations",
"licence",
"licenses",
"permit",
"permits",
"police",
"court",
"courts",
"tribunal",
"tribunals",
"policing",
"crime"

This approach means a keyword can appear multiple times, which could be an indicator of strength, but we ignore this and only consider any presence of a keyword.

// An example of a dataset's keywords used to attach categories
{
'Council and Government': ['council'],
'Food and Environment': ['allotment', 'land', 'environment'],
'Housing and Estates': ['allotment']
}

It’s not a perfect process, but it works better than it did before. This will work for as long as we keep curating that keyword/ phrase to category association. There is a new onus to keep the keyword/phrase to category mapping small, light, succint and effective. Keywords/phrases should ideally be distinctictive and unique to their categories. It make take us a few iterrations to get it feeling right.

The next step to improve on this approach, is probably to take away that onus of curating a mapping. It’s what we’ve been dubbing “smarter categorisation” and here’s where ticket #172 comes in if anybody would love to take up the challenge. Or if you’d like a more beginner-friendly task, here’s #214, which is still an improvement in any sense.

Yet perhaps the most exciting thing to come out of this exercise is the series of checks now devised to help us monitor the quality of categorisation from here on out. But, it’s still sitting in my local because it’s now sparked a bigger thinking about where we would house these data quality checks which are not code tests, a post for another day.

These checks help us evaluate whether the categorisation system is fit purpose by looking at:

  • The number of categories assigned to a dataset — we currently have 16 categories in Open Data Scotland and if a dataset is assigned to all of them, then categorisation is rather meaningless. Currently 75% of our datasets have 3 or fewer categories, with the maximum at 9 categories.
  • The volume of data sets in each category. This might lead us to identify which categories are poorly keyword-ed, but it also might just be a reflection of the nature of the datasets we have, especially as we know we are missing publishers and are bare in some topics.
  • The prevalence of keywords used to identify each category, but also the keywords and phrases never used.
  • And then we look at the raw text of our uncategorised datasets, so we can manually pick out new interesting keywords to update the keyword-category mappings with.
Effect of the change in categorisation from before (left) and after (right). Screenshots taken 5 and 7 Jan 2023.

I imagine we could run these checks every 3–6 months to keep on top of catalogue quality. Because although we see dataset updates happening every week, newly added datasets are far and few between, and the biggest movements come when we add a new publisher.

To see the code activity behind this change, view the_od_bods repo issue #211 and PR #212

--

--