Transfer Learning to predict student performance

Paul Amoruso
Machine Intelligence and Deep Learning
12 min readApr 12, 2022

This report is based on the Transfer Learning from Deep Neural Networks for Predicting Student Performance research paper.
EEL 6812 course project report.

The Papers Purpose:
In the paper, “Transfer Learning from Deep Neural Networks for Predicting Student Performance”, it proposes the idea of taking a machine learning model made for one course and applying it to another similar course. Transfer models show their usefulness in scenarios when you have trainable data from one task but not enough in a similar. Thus, the researchers predicted the student performance based on model trained by the performance from a previous course.

The Importance of This Research:
With an increased popularization of Learning Management Systems (LMSs) such as Moodle and Canvas within universities, student performance and digital footprints are at disposal of the schools faculty. Educational data mining is getting easier than ever before and is an increasing field of study. With this said, much of the educational data being gathered is from the LMSs, and can be used to help students. However, transferring models between courses may prove to be significantly tougher due to differences of features between two of the same courses. Therefore, the main objective of the research is the ability to create a model with the utilization of generalization via data from Moodle to improve the predictive performance of the students in the target course by using the knowledge acquired from the source course. If educational institutions can predict the performance of students with high accuracy, they could put in necessary preventative measures including targeted tutoring and remediation to lower the probability of low student performance.

The Differences In The Learning Process:
As seen in Figure 1, the traditional approach is to learn from scratch with the testing and training dataset being derived from the same course. The transfer learning process you take the model built for Course A and apply it to Course B. However, this process could be rather difficult as the current model may not be able to generalize the data coming from another course, and different class labels between the two tasks.

Figure 1. Traditional learning process vs transfer learning process.

The Transfer approach is actually similar to what humans do. As we take knowledge from one similar task to apply to another task in order to speed up the process of learning. Therefore, the transfer learning process aims to transfer the knowledge from Course A to Course B along with improving its predictive performance, rather than making a whole new model with the traditional approach. This works as, the researchers make sure the datasets share common characteristics.

This description can be seen in the original paper

Looking at the definitions, there are three types of transfer learning settings known as inductive transfer learning, transductive transfer learning, and unsupervised transfer learning. With inductive, the source and target domains are different but related regardless of the relationship between the tasks. Transductive has target and source task equal while the domains are different. Then finally, unsupervised has it where the tasks are different while the two datasets do not contain labels and is often intended for gathering and dimensionality reductions tasks.

Example of inductive transfer learning:Let's imagine a scenario where there is a team of robots have been taught to keep a basketball away from their opposing team. Therefore, the robots learn to stay a certain distance from the opposing team. Then let us say that they need to teach the team of robots to play another game where it is necessary to play offensively and score against the opposing team. Thus, the knowledge learned from the first game can be transferred to the second game. So, for instance, a robot now faced with the second task may observe that it is ideal to shoot the ball rather than passing the ball as the basketball hoop is very close. Which may be learned by looking at the distances of the hoop compared to the position of the teammate to the basketball hoop.
Example of Transductive Transfer learning: let's imagine a scenario where there is a team of robots have been taught to keep a basketball away from their opposing team. Therefore, the robots learn to stay a certain distance from the opposing team. Then let us say that they need to teach the team of robots to play another game where it is necessary to play offensively and score against the opposing team. Thus, the knowledge learned from the first game can be transferred to the second game. So, for instance, a robot now faced with the second task may observe that it is ideal to shoot the ball rather than passing the ball as the basketball hoop is very close. Which may be learned by looking at the distances of the hoop compared to the position of the teammate to the basketball hoop.
Example of Unsupervised Transfer Learning: Unsupervised transfer learning is the utilization of implementing unsupervised machine learning in source and target domains. So, for instance, industrial applications, such as those in charge of supervising large industrial systems consists of learning tasks that are unsupervised. This is since there are an enormous number of parts and mechanisms cooperating with one another, making the number of possible errors countless. However, these systems are built with the design to be reliable, making it is nearly unrealistic to collect a labeled dataset of all possible instances. Therefore, the unsupervised model is trained by recognizing the data from what operating conditions of a well-functioning system looks like, so that it can raise notifications when the test data is sufficiently different from those seen in training.

Their Methodology:
The overarching goal to the researchers methodology is to answer if transfer learning models can be used with EDM for predicting student performance. Thus, they sought out to answer two specific questions first, which are if “the weights of deep learning model trained on a specific course be used as the starting point for a model of another related course?” and “will the pre-trained model reduce the training effort for the deep model of the second course?” In order to solve these questions the researchers initialized a deep learning network using pre-tuned weights of a similar course.

This research took a look at five core undergraduate courses of two undergraduate programs in Greece. The course titles were Physical Chemistry I and Analytical Chemistry Laboratory, and Physics III. Each course in the Moodle LMS contained plenty of student data to train from. With this, you can see the following table provided by the original paper to display the five course genders and the target class distribution.

The researchers made the datasets from six different resources, consisting of course pages, recourses, URLs, forums, folders, and assignments. In Table 2 you may see that class course C1 has one forum, seven pages, 17 resources, eight assignments, and two folders. Out of the six resources, three of them where required to have in the course, meaning that you know for sure that each course will have at least those features. In collecting the data, they recorded the number of times each student viewed the forum, and the number of times a student accessed the other resources, giving them a total of two counters for each student.

C1, C2, and C3 where Physical Chemistry and Physics, while C3, and C5 was laboratory courses.

A quick disclaimer is that these courses were taught different among who's teaching and the semester. Also note that the table does not show the same student in each class.

The network they trained on has an input layer, two hidden dense layers and an output layer. The input layer has units corresponding to each of the dataset input features, the first hidden layer has 12 hidden units and the second one has eight. The dense layers use the Relu activation function, and the output layer consists of a single neuron using the sigmoid activation function and the binary cross entropy loss function for predicting if the student will pass the class.

The procedure took place in three phases, the phase is essentially pairing all possible unique pairs (ten pairs in total), then to rebuild the dataset so that all the set of features between the two courses are the same (such as C1 & C2 having different number of pages) such as in Table 3.

Table from the Paper

The second phase refers to the training of the two supporting deep networks. One model was trained on the source course (Ci) to extract its adjusted weights, while the other model was trained on the target course (Cj) to calculate the baseline evaluation. The models were trained for 150 epochs and 10-fold cross validation resampling procedure was adopted for evaluating overall performance of the deep network models.

Figure 2: The Three Phases

The third phase is to be the most fundamental, as it implements the strategy of transferring the learning model to the second course. The model of the target course was fitted from scratch, but the network weights were initialized prior by the source course (made in the second phase). The pre-trained model Ci,j was further tuned by running it every time for a certain number of epochs starting from zero and go up by 10.
Note the code was made using Keras library in Python.

Researchers Pseudo code

Their Results:
The following figure shows the average accuracy results of 10 pairs each consisting of two experiments to have a total of 20 evaluations. As the five courses made 10 combinations of pairs and of those pairs they each reversed the roles from being the source course and target course. Table 4 shows the bold values as the cases where the transfer model produced better results than the baseline. In general, the model Ci,j benefited from the utilization of the source course Ci weights, since the predictive performance of the transfer learning deep network was better than the baseline Cj.

To verify whether the transfer model was statistically significant, they compared the accuracy results from the baseline deep network to the results from the transfer method for each number of epochs iteratively. Using the one-tailed paired t-test where a equals 0.05, they were able to conclude that there was significant differences between epochs with the p-value being inferior or equal to 0.05 other than at zero epochs. Moreover, Table 5 shows that the p-value gradually decreased as the number of epochs increases such as from 10 to 100.

Circling back to the first question of if the deep learning trained on one course can be the starting point for another, we can first look at the results with zero epochs (using weights estimated by previous model) 10 out of the 20 tests does better than the baseline model. Furthermore, you may see that with 10 epochs of further tuning 16 out of the 20 datasets do better than the baseline. To back it up, the t-test shows that zero epochs is not significant as p-value is larger than a = 0.05. Thus, concluding it is can be a starting point.

When it comes to answering if the pre-trained model reduces the training for the following course, the answer can be yes, as seen with 100 epochs the accuracy is the highest with 18 out of 20 tests being better than the baseline, right before overfitting occurs around 150 epochs, which is better than 150 epochs for training from scratch. Moreover, this is seen with the p-value going from 0.0002 up to 0.0012 with overfitting occurring around 150 epochs.

Their findings:
The results show that with prior knowledge from a previous course dataset the model predicts fairly well with a related course. With 20 different pairs there was an improvement in half of the datasets using the pre-trained weights from the previous course, and an accuracy improvement of 16 out of 20 in most cases with a further tuning of 10 to 40 epochs. This means that fine-tuning showed greater benefits rather than training with randomized weights, allowing for higher accuracy with less passes. However, is should be noted that this does not work 100 percent out of the time, as the transfer learning model did not achieve better results than the baseline when course 5 was the source and course 4 was the target course.

How this connects with my personal research:
My present research is in the realm of educational data mining for my Master’s, with the hopes of unsupervised Learning for my PhD once enough data has been collected. On the front burner, we are utilizing API requests to gather questions (seen in the image below) and student statistics on the performance of assessments. This data is collected and formulated in matrixes within CSV files for further analysis and tagging of skills. As of now, the ongoing approach is data mining and tagging questions to assess student skills among multiple courses and potentially amongst universities. At the current point, unsupervised learning to the data has not yet been implemented.

API requests on Canvas
Canvas API support website
Moodle API Support website

Since my research highly relies on educational data mining, with a future goal of machine learning, this paper ideal for me to gain knowledge on similar studies in this general niche.
My current research.

My attempt a recreating a similar model approach:
In my attempt at recreating something similar to the authors methodology, I first made a .CSV file with 10 columns each representing five quizzes and five tests where the pass grade is represented as a 1 if the average is greater than a 7, and a 0 if not.

How to make random grade
If the average grade is passing

I found that by simply using a model trained on a dataset and then evaluated on a dataset that was missing a quiz, we found that there was an initial lower accuracy however with just a little fitting, the accuracy was larger than the baseline, as seen in the research.

The CSV with random grades.
Training the source course
The evaluation with the original testing dataset
With the original model but new dataset
Training with 50 epochs more
The evaluation with 50 more epochs

My opinions:
Although unaware of all the intricacies of the researchers’ program that collects and forms the dataset, I would believe there are some inconsistencies which were not addressed in the paper, such as what happens if someone accesses a page once since they downloaded a document locally or if someone continuously refreshes a page. Another component not considered is how would one fine tune the model if they have not yet gathered enough information to train on for the target course. In practicality you want to use a model to predict which students need additional tutoring or assistance, thus a trusty model that can be transferred without any additional fine tuning would be necessary.

Conclusion:
As an intriguing idea of implementing predictions in the field of education, in order to obtain its effectiveness on larger scale, the machine learning model needs to be tested on many more courses in order to potentially predict regression tasks such as grades. Overall, it is binding the gap in the way humans transfer knowledge from one similar skill to another in the form of transfer models. Further practical implementations of end products using this algorithm/methodology should be acquired to advance its performance.

References:

  1. Tsiakmaki, Maria, et al. “Transfer learning from deep neural networks for predicting student performance.” Applied Sciences 10.6 (2020): 2145.
  2. Boyer, S.A. Transfer Learning for Predictive Models in MOOCs. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2016.
  3. https://canvas.instructure.com/doc/api/
  4. https://docs.moodle.org/dev/Web_service_API_functions
  5. Mallawaarachchi, Vijini. “Inductive vs. Transductive Learning.” Medium, Towards Data Science, 11 May 2020, https://towardsdatascience.com/inductive-vs-transductive-learning-e608e786f7d.
  6. Ricardo Vilalta, Christophe Giraud-Carrier, Pavel Brazdil, & Carlos Soares. (1970, January 1). Inductive transfer. SpringerLink. Retrieved April 27, 2022, from https://link.springer.com/referenceworkentry/10.1007/978-0-387-30164-8_401#springerlink-search
  7. https://dl.acm.org/doi/pdf/10.1145/3453146
  8. Michau, G., & Fink, O. (2021, January 29). Unsupervised transfer learning for ANOMALY DETECTION: Application to complementary operating condition transfer. Knowledge-Based Systems. Retrieved April 27, 2022, from https://www.sciencedirect.com/science/article/pii/S0950705121000794

Github link:
https://github.com/AwesomePaul100/EEL-6812-Trasnfer-learning

--

--