Identifying missing metadata by putting people into the data pipeline

Published in

Twinkl Educational Publishers

4 min readJun 16, 2022

Learn about how we’ve utilised the value stored in our data assets to enrich our resource metadata with behaviourally relevant metadata

This article was originally published by Oscar South, Data Scientist at Twinkl, at https://www.twinkl.co.uk/blog/behaviour-driven-metadata

Behaviourally driven metadata allows us to provide the most relevant search results for our subscribers

Here at Twinkl, we put people first — this is encapsulated throughout all our values, nowhere more strongly than here:

Go above and beyond

People over process. Exceeding expectations.

Helpfulness. Excellence. Amazing others.

From my vantage point on the data team, this resonates with a highly tweetable slogan that I heard at a tech conference in 2014 and which has stuck with me since then: “Data is people in disguise”. This philosophy is demonstrated beautifully in a project I’ve worked on recently, so let’s talk about it!

Missing Metadata

Our search system is our primary medium of communication with our subscribers — they express what they need as best they can and we bring our best efforts to satisfy that need to the table in a brutally streamlined dialogue of “search” -> [“results”]. As part of a current push to optimise the quality of our search systems, I’ve been algorithmically identifying relevant metadata terms (these are internally used words or phrases which encapsulate the nature of each resource) that may be missing from our internal resource metadata.

After all, it doesn’t matter how smart your search systems are if the term your user searches doesn’t match against anything in the potential result metadata (actually, this gives me an idea .. maybe that’ll show up in a future blog post).

I like to simplify how I think about a project down to the inputs and outputs. In this case, the inputs were our search log and internal resource metadata, and the outputs are a list of individual missing metadata terms associated with different resources. I’m going to skip talking about technical details/process here and draw attention to the fact that both of the inputs I’m working with are a direct expression of human intention:

The search log, inputted by our subscribers, is an expression of their personal requirements.
Our resource metadata, inputted by my colleagues here at Twinkl, is an expression of their interpretation of the value which that particular resource provides to our subscribers.

arbitrarily simple graph query representing connections between searches and downloads through users:

In this case, as a data scientist I’m essentially acting as a mediator/translator between these two input perspectives. Where in many cases the subscriber’s need and my Twinkl colleague’s interpretation may communicate perfectly and return exactly what they want, in many other cases it may not. This is the nature of human communication. My job here is to identify and generalise these positive communication pathways across our entire library of resources, maximising the probability of a positive communication. In my opinion, this is one of the best parts of the job — it is a profession that requires analytical skills in equal measure to creativity, flexibility and the ability to look past the maths to see problems for what they are.

In the end, the methodology was nothing more technical than some text cleaning, aggregations of frequencies, a sprinkling of context from an LDA topic model I’d previously developed for Twinkl during an earlier project (I might speak about this more in a future blog post) and an open channel of communication with my colleagues in the organisation who will action the initial results of this work — I have to thank our brilliant team leaders here for helping to facilitate this with finely refined tact and finesse!

If you like the sound of what we do here, then you’ll be happy to know that our Data Scientist team is currently hiring.

Check out some more articles from other members of the data team.

Identifying missing metadata by putting people into the data pipeline

Behaviourally driven metadata allows us to provide the most relevant search results for our subscribers

Written by Twinkl Data Team