Using NLP and graphs for educational video recommendations

Published in

DataKind Bengaluru

5 min readFeb 14, 2022

Introduction

Pratham is one of the largest and most successful NGOs working on improving quality of education in India. This engagement with DataKind Bengaluru was focused on the Pratham Open School, which hosts free educational content in English and various Indian languages. The videos are categorised according to language, subject, and topic. The ask was to have a mechanism to recommend videos to learners who were watching something.

Data

The team worked with video transcripts as well as having the possibility of including material from textbooks. For the initial stages of the project, only content-based approaches were considered. Working on user details and browsing behaviour sounded really interesting, but the team couldn’t work with.

Approach

The team tried to figure out if two videos were similar in any way, it could surface one as a recommendation while the other is being watched — just like on youtube. This led to find the sources of similarities that could be found between multiple videos and their transcripts, which would lead a user from watching one video to the next most similar video. The team explored the following ways of answering this:

Existing categorisation: The first approach was to use the videos on their site that could serve as recommendations. Eg: Show the rest of the videos in the same sub-category and add next and previous buttons for navigation.
Curation: Another approach was to define the relationships between videos manually. Once a relationship was built between 2 videos, could then be shown when required.
Clustering/Topic Modelling: Clustering the videos based on the words and phrases containing in the video was the third approach that the team decided to take. These groups would help provide the recommendations based on their content.
Embeddings: The videos also contained transcripts. These transcripts could be converted into embeddings, for instance using doc2vec. The recommendations could be the videos with the lowest cosine distance.
“Entities” detected in a transcript: Finally, the last approach was about the videos that could be related according to the ‘entities’ that were being talked about in them.

Solution

The team chose to work on the last approach, “Entity recognition and graphs”. Named Entity Recognition (NER) was used to identify the entities being talked about in each video. The team could create a network with the videos connected to the entities found in them. In such a graph the videos connected to each other through entities would be considered similar and therefore become recommendations.

At this stage, there was a lot of human expertise that could aid the team in creating a better graph. The automated methods would always have some things missed or inappropriately added. If there were a way to let the education experts contribute to the graph the team could combine the best of the humans and machines for a superior graph.

Use of Obsidian to curate better quality content

Here is where piggy-backing on the features of the note-taking application Obsidian made sense. Obsidian works on a local folder of plain text markdown files which it calls vault. This meant if the team made markdown files for each node in our graph, one could open the folder in Obsidian, at which point all the features of Obsidian can be put to use in pruning, editing, and enhancing the graph. Since the team could read the resulting files back into a graph, the team gained the ability to incorporate expert inputs pretty easily.

Based on the paths to other videos (nodes) linked (recommendations) to each other can be computed.

Future

The work that the team has done is limited to English medium currently. There is scope to use similar approaches with Indic NLP toolkits as well.
These approaches can be validated and be more useful when more transcripts data becomes available.
Other approaches mentioned in the Approach section can be picked up once the easier, obvious enhancements using the first two approaches are made.

If you would like to help contribute and improve this work — head over to Github repository where the teams’ work is made open source.

Team

Rithwik is a data scientist with interests in complexity, simulations, and decision-making. He often draws from design, science, and philosophy, and is strangely turned off by Deep Learning (TM). He likes to read and write, both words and code. (https://rtwk.org)

Soumya Ranjan currently works at Development Seed and previously worked as a lead data scientist at Gramener. His interests and passion has always been Open Source, Data, AI & EdTech. He likes to be called as a storyteller who uses data & AI.

We also thank the Pratham Education Team who partnered with us and helped us through this journey.

DataKind is always looking for volunteers to make social impact. You can join us on Slack to know about what’s happening. If you are interested in joining the Bangalore chapter’s core team, check out how you can get involved.

We also have a new beta version website which you can check out for more details.