Career Guesser

Twinkl Data Team
Twinkl Educational Publishers
5 min readJul 1, 2022

Learn about how we apply probabilistic machine learning to provide the best new user journey.

This article was published by Matthew Thornton, Data Scientist at Twinkl.

At Twinkl, we know that every user is unique. Every person we strive to help teach has their own preferences, their own needs, and their own style. One of the big ways that the Data team helps to fulfil Twinkl’s mission to “help those who teach” is by ensuring that each Twinkl user receives an experience that is as unique as they are.

Many aspects of the Twinkl website are fine-tuned, tested, and continually monitored to ensure that we provide a smooth user experience, and to ensure that every single user is able to make the most of the resources and tools that we provide.

For example, once a user creates an account for the first time, we invite them to specify their career. Do they teach infants? Do they teach teenagers? Did they join Twinkl as a parent of learners? Are they a librarian or senior management or a trainee teacher? We can then use this information to help customise their Twinkl experience: providing better resource recommendations, for example, or promoting specific tools which they might find useful.

But — and here’s the kicker — this requires the user to be signed in. Otherwise, how can we know which of the possible customisations to provide?

One recent project which I have been working on is to help with this so-called “cold start problem”. The central question is this:

How can Twinkl provide useful recommendations and customisations to users before they even create an account?

If we can crack this then we can help more people know the value that Twinkl can offer. And for subscribed users, we can provide a smoother and more consistent overall experience.

What do we know about a user before they have signed in? Well, for one, we know their country. Did they visit twinkl.co.uk or twinkl.com or twinkl.com.au or any of the other country-specific domain variations? Already we can start to customise for these different markets, immediately directing the user to country- and curriculum- specific resources.

But what if we could also recommend the right age-range of resources? If we can take a guess at the user’s career before they sign in, then we can start to further tweak our recommendations.

(some light maths incoming — but nothing too strenuous, I promise!)

It turns out that what we want to estimate is the following

which is known as a ‘conditional probability’. This might sound intimidating, but it has a very simple interpretation. In fact, we make these kinds of intuitive judgments in our day-to-day lives all the time! I saw that it was cloudy earlier, so how likely is it to be raining currently? The symbols above are just a shorthand for making a probabilistic assessment based on already known information. Here they mean “what is the probability that the user has a specific career, given all of the available information?” In our case, in addition to the country, a key clue which we have is which page of the Twinkl website the user has visited.

Let’s start to make this concrete with some examples.

A user who looks a the following resource

might be likely to teach very young children, while for somebody who takes a look at

it’s more difficult to make a good guess. This type of resource is of such broad appeal.

So our conditional probability looks like

or in other words, once we know the resource page which a user viewed, what is the probability that they are in a particular career?

Our first resource has the following distribution of conditional probabilities,

​​

so a user who views it is very likely to teach early-years, while our second resource looks like,

​​

confirming our intuition that we can’t really make a good guess in this case.

Pulling all of this together, to customise the site for a signed-out user, we can look at the pages which they have viewed, calculate the conditional probabilities for each career, and guess the career which has the highest probability.

If we do this, and then compare our guess to what the user explicitly tells us after they create an account, we get it right about 50% of the time. Not bad at all!

But we can do better.

By only allowing resources which are very much preferred by a single career type — example 1 above told us a lot about the people who viewed it, while example 2 told us almost nothing — we can push our guessing accuracy up to 80% or even higher! In other words, we keep only the probability distributions with the highest information content (lowest entropy). This increases our accuracy but decreases the number of affected users. 80%+ is not bad for such a simple model!

What are the next steps in the project? Well, we currently have a slightly tricky choice to make. Should we prefer to guess a smaller number of users with very high accuracy, or to guess a large number of users with slightly lower accuracy? Unfortunately this is not the sort of question which can be answered from the information currently available to us. Instead, we’ll need to deploy our model onto the website and start making guesses for real users. By checking how they respond to the displayed customisations — and comparing this response with that of users from the known career — we can start to find the combination of accuracy + on-site customisation which will give them the best possible experience of Twinkl.

If you like the sound of what we do here, then you’ll be happy to know that our Data Scientist team is currently hiring.

Check out some more articles from other members of the data team.

--

--