How to use time order in machine learning recommendations for education

Machine learning recommendations connect users with content they want or need. For example, Amazon recommends below books for the Lord of the Rings.

Time order does not matter much in product recommendations offered by Amazon or Netflix. You can first read either the Lord of the Rings or the Game of Thrones before the other.

For educational content the order is more important. Knewton has an explicit prerequisite (knowledge) graph that states that you learn concept X before concept Y. For example, you first exercise single digit additions 3+4before going to more complex additions like 13+4.

In this blog post I explain how to extend simple collaborative filtering (CF) recommendation algorithms to use time order. The focus is in education domain where the time order of content also tells about:

  • the average learning paths taken by pupils, or
  • perhaps even the prerequisite graph of concepts.

The content of this blog post comes from data science work done for adaptive learning platform Bingel of VANIN. Many thanks for VANIN for letting to share some examples in this blog post!

Simple collaborative filtering algorithms

Let’s first introduce simple yet powerful collaborative filtering (CF) algorithms of type:

  • users who viewed X also bought Y.
  • users who bought X also bought Y.

Implicit item-to-item cosine similarity is one algorithm to formalise these. We define cosine similarity, so that we can later compare to an algorithm that has time order included. In it your data is implicit “I like” data such as pairs (user, content). Then for all content pairs (X, Y) you calculate:

count(X, Y):
number of users who bought both X and Y.
number of users who bough X.
similarity(X, Y):
count(X, Y) / sqrt( count(X) * count(Y) ).

For X you recommend Y that has high value for similarity(X, Y). I omit some technical details, like how to handle well content with little data.

Including time order in collaborative filtering

The simple CF algorithms inspire time ordered relations for educational content:

follows(X => Y): 
probability that exercise Y follows exercise X.
precedes(X => Y): 
probability that exercise X precedes exercise block Y.
prerequisite(X of Y): 
balanced approach of the above two, so both Y should be done after X and X has been done before Y. Let’s say
prerequisite(X of Y) = sqrt(follows(X => Y) * precedes(X => Y)).

Above relations are bit fuzzy so we’ll later formalise them. Before that let’s look at an example from a real-world dataset!

Example: Bingel learning path for “Read clock” exercise

Bingel is an adaptive learning and exercising platform for elementary education used by over 600000 pupils in Belgium, Finland, and Sweden. In the past 5 years more than 1 billion exercises have been made on Bingel.

In 2017 we did a study on the time ordering of exercises with Bingel data. The study inferred the time ordered relations from pairs of data (pupil made exercise, teacher given homework exercise) . Note how the exercise on the right (to the future of the left side) has to be given by the teacher as home work, but on the left there can be any exercise done by the pupil.

Below images have top-5 exercises for different time order relations for math exercise “Read clock” in chapter 2 of 3rd grade book in primary school.

In this single example, the top list of the prerequisite relation had more relevant exercises than the precedes relation.

Formalisation of time ordered collaborative filtering

You can do formalisation in many ways and what follows is just one of them, but its the one that worked for us.

The input data is a big set of records

(user, exercise, timestamp),

which we transform to ordered exercise pairs

(user, X, Y, t),

where X is an exercise done before exercise Y by the user, and the time in between in days is t. We make (user, X, Y)unique by filtering to the first time an exercise is done by a user.

Let’s look at an example. Below image has two pupils, Alice and Bob, and the exercises that they made over three days.

The input records are:

(user,  exercise, day)
(Alice, X, Monday)
(Alice, Y, Tuesday)
(Alice, Z, Wednesday)
(Bob, Z, Monday)
(Bob, X, Tuesday)
(Bob, Y, Wednesday)

And the exercise pairs are:

(exercise before, exercise after, time in between)
(X, Y, 1d)
(X, Z, 2d)
(Y, Z, 1d)
(Z, X, 1d)
(Z, Y, 2d)
(X, Y, 1d)

We next calculate intermediate values:

count(X = > Y):
number of times the exercise pair (X, Y) occurs.
number of times X occurs at the left side of exercise pair (X, _).
number of times Y occurs at the right side of exercise pair (_, Y).

And with the intermediate values we get the wanted relations:

follows(X => Y):
count(X => Y) / count_before(X)
precedes(X => Y):
count(X => Y) / count_after(Y)
prerequisite( X of Y ):
sqrt ( follows(A => B) * precedes(A => B) )

In our example:

follows(X => Y) = 2/3
follows(X => Z) = 1/3

Including time distance between exercises

Above we had super-simple before/after relation which did not take into account whether there was one day or 100 days in between exercises. However, you get better results if you handle time distance better.

One easy way is to attach a time weight to an exercise pair(X, Y, t) . For example, time_weight(t) = exp(-t/20) . Then modify count(X=>Y) to be the sum of the time weights for the exercise pairs(X, Y) instead of counting such pairs. Do the same for all count functions.

There are other ways to weight exercise pairs and what kind of weighting works depends on your data. The way given here worked for data collected over one school year.

For the math junkies only: is the usual cosine similarity connected to the prerequisite-relation?

The formula for the prerequisite relation above looks much like the item-to-item cosine similarity rule. Let’s look closer.

similarity(X, Y):
= count(X, Y) / sqrt( count(X) * count(Y) )
= cosine(vec(X), vec(Y)),

where vec(X) is a 0/1 vector where coordinate vec(X)_i is 1 if user i bought item X . On the other hand:

prequisite( X of Y)
= sqrt( follows(X => Y) * precedes(X => Y) )
= count(X => Y) / sqrt( count_before(X) * count_after(Y) )
= cosine(vec_before(X), vec_after(Y)).

Here vec_before(X)_i is 1 if X occurs at left side of the item pairi , and vec_after(X)_i is 1 if X occurs at the right side of the pair i .

So both similarity(X, Y) and prerequisite(X of Y) are cosines of two 0/1- vectors. In the former the coordinates correspond to users, but in the latter the coordinates correspond to pairs of items. We can bridge the definitions further: for similarity(X, Y) make each pair of exercises done by the same user two new “virtual” users (X, Y) and (Y, X). Then similarity(X, Y) is the same as prerequisite(X of Y) which does have a more intuitive definition through probabilities. Cool?

Related work

I’m happy to hear about similar work done elsewhere. We collected some related work to this document.