Machine Learning Application: Predicting Students’ Academic Performance

Deep-diving into Educational Data Mining to train and evaluate machine learning models to predict students’ academic performances, using Supervised Linear Regression.

Usman Aftab Khan
CodeX
9 min readJul 7, 2021

--

Training a Machine Learning Model.

Introduction

A. Background — What is Machine Learning?

Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data. It is a subset of Artificial Intelligence, based on the ideology that a system can not only learn from data, but identify hidden trends and patterns, and further make decisions with little human intervention. It can be said that machine learning is a method of data analysis that enables the practitioner to facilitate the management into making better-informed decisions based on the desired outcomes and given data.

There are many types of machine learning algorithms. The most common categories are Supervised Learning, Unsupervised Learning, and Reinforcement Learning.

Types of Machine Learning and their subsequent branches.
  1. Supervised Learning is a method where the model is trained using labeled data. All materials are labeled manually through annotation to direct the machine with corresponding values so it can predict the new, unseen data with correct values.
  2. Unsupervised Learning is a method where the model is trained using unlabelled data. The machine classifies materials by itself by detecting the characteristics of the data. Manual classification is not involved in this method and thus more errors occur upon final predictions. To achieve a certain level of correctness in unsupervised learning, integration of reinforcement learning is necessary.
  3. Reinforcement Learning is a method where the model uses observations gathered from the interaction with the environment, to take actions that would maximize the reward or minimize the risk. There is no labeled data, but feedback is given to the model regarding which step is correct and which step is incorrect. According to the quality of the feedback, the machine gradually amends its classification and finally gets the correct result.

Application

A. Use Case — Objective

Research on the educational field involving machine learning techniques has recently taken a steep growth trajectory. A new term called “Educational Data Mining” has come into existence, i.e., the application of data mining techniques in an educational background aiming to discover hidden trends and patterns about student’s performance.

This project aims to develop a prediction model for students’ academic performance based on machine learning techniques. The resultant model can be used to identify any student’s performance for a particular subject.

The task is to predict the marks that a student is expected to score, based on the number of hours studied.

B. Description of Data

The data required will be a CSV (comma-separated values) file that had been provided for this analysis.

A snippet of the CSV file.

The raw data was cleaned, modified, and given an aesthetically pleasing look for general interpretation. As seen above, there are two columns: Hours and Scores with 25 values in each column. Upon viewing this dataset, and aided by general perception, the hypothesis built is that there is a positive correlation between our two variables. Let’s proceed further and conclude if our hypothesis is correct or not.

C. Libraries to be Used

Python Libraries are a set of useful functions that eliminate the need to write any code from scratch. As of today, there are over 137,000 libraries present in Python. However, we will only be utilizing four for our task; Pandas, NumPy, Matpotlib, and Scikit-Learn.

  • Pandas is an open-source software library written for the Python programming language for high-performance data manipulation and analysis. In particular, it offers powerful data structures and operations for manipulating numerical tables and time series. Python with Pandas is used in a variety of fields including fields such as finance, economics, and other commercial domains.
  • NumPy is a library adding support for large, multi-dimensional arrays and matrices and a collection of routines for processing those arrays. NumPy also provides a large collection of high-level mathematical functions and logical operations to perform on these arrays.
  • Matplotlib is a plotting library and one of the most popular Python packages used for data visualization. It provides an object-oriented API for making 2D plots from data in arrays.
  • Scikit-learn is a machine learning library used to build models as it has tons of tools used for predictive modeling and analysis. It features various algorithms like support vector machines, random forests, and k-neighbors.

This project gives Data Scientists the opportunity to apply their knowledge of data science and categorically go through different processes. Defining the business problem, requirement elicitation, retrieving and utilizing raw data from external sources, parsing and cleaning the data, and analytical assessment through Machine Learning algorithms and tools. The evaluation from the final analysis leads to a conclusion which can then be leveraged by stakeholders, inlcuding but not limited to academic counselors, professors, and parents. As this project has a lot of aspects to be considered, it is open for discussion and targeted towards the entrepreneurs and stakeholders.

Building a Prediction Model

A. Analytical Approach

Supervised Machine Learning will be applied to predict and analyze a student’s marks. For this task, we begin our pursuit by approaching the problem using a technique called the “simple linear regression model”. It is a statistical model commonly used to estimate the relationship between two quantitative variables; one dependent variable and one or more independent variables using a line. This algorithm is fast and efficient for a small and medium-sized database and is useful to quickly discover insights from labeled data.

Our two quantitative variables are:

  1. the percentage of marks scored by each student on a particular subject.
  2. the number of hours studied by each student on a particular subject.

B. Data Analysis

I. Importing Libraries & Loading Data

We will import the libraries involved. Please note that Scikit-Learn will be imported later on.

A snippet of libraries being imported.

The next step is to load the given data into the Python Interpreter I used on Jovian, to proceed with the training of the model. Pandas are used to load the CSV file and give a confirmation of sorts when the data is successfully loaded.

Note: I had saved the file into my system, in the same directory as this very Interpreter.

A snippet of data being loaded into Python Interpreter. A confirmation message is received at the end.

II. Visualizing Data & Gaining Insights

Before proceeding further, we will check a summary of the technical information of our data. The info() function used prints a concise summary of a specific DataFrame. This function provides information about a DataFrame including but not limited to:

  • index type
  • column dtypes
  • non-null values
  • memory usage
A snippet allowing the information regarding our DataFrame to be seen using the info() function.

Based on the information given above, we can reiterate that there are two columns called Hours and Scores, and there are a total of 25 values in each column. Thus, it can be concluded that there are 25 elements in the data being fed to the machine learning model.

The type of data (dtype) in the hours is float, while the type of data (dtype) in the Scores is an integer. For future purposes, both columns should have the same type of data.

A snippet showing the information of the data frame where both dtypes are the same.

Upon successful import and conversion of both columns’ data types to be the same, the data can be previewed using the head() function.

A snippet of code previewing the data. A confirmation message is received at the end.

Notice how the head() function only previews the top five elements by default. This can be customized by simply adding the number of elements that are required to be seen between the parenthesis.

A snippet of the dataset’s first ten elements is shown.

Being a Data Scientist requires a combination of skills. They can be divided into three groups: technical skills, functional skills, and soft skills. In particular, functional skills include having a good sense of numbers. One should be able to analyze and translate what the numbers are saying. This requires a firm hold in statistics and room for interpretation. Fortunately, the describe() function provides a set of important values for further statistical analysis.

A snippet of code showing the statistics of the dataset.

III. Plotting the Data

The next phase is to enter the distribution scores and plot them according to the requirement. The data points are plotted on a 2-D graph to visualize the dataset and see if any relationship between the data can be identified. The plot is created using the following script:

A snippet of code led to the plotting of a 2-D graph to identify relationships in the dataset.

From the graph above, it is evident that there is a positive linear relationship between the two variables, i.e. number of hours studied and the percentage of scores are directly proportional.

IV. Prepare Data for Machine Learning Algorithm

A deluge of data is present in a host of different formats, structures, and sources. A crucial part of a Data Scientist’s job is to prepare this data by cleaning, organizing, and optimizing for use by end-users. End-users include business stakeholders, analysts, and programmers. The “prepared” data is then used to interpret the results and relay information for the management to make better-informed decisions.

A snippet of code indicating there are no null values.
A snippet of code where the head() function is used to check the dataset.

Now that we are 100% sure about our dataset being free from null data, the next step is to divide the data into attributes (inputs) and labels (outputs).

A snippet of code where the data is divided into attributes & labels.

Now that our attributes and labels are in place, the next step is to split this data into training and test sets. This is done by using Scikit-Learn’s built-in train_test_split() method.

A snippet of code where the data is being split into two sets: training & testing.

Upon successful splitting of the data into training and testing sets, it is finally time to train the algorithm. As mentioned above, a simple linear regression model is to be used.

C. Training the Machine Learning Algorithm

A snippet code showing the training of our algorithm. A confirmation message is received at the end.

Since the training is done, we will plot the regression line.

A snippet of the 2-D graph showing a positive correlation between the two variables.

As evidenced by the regression line on the 2-D graph, our two variables have a positive correlation. This further adds to our earlier hypothesis which can now be generally accepted as true.

D. Making Predictions

Now that our algorithm has been trained, it is time to make predictions.

A snippet of the predictions made by our ML model.
A snippet of the comparison between actual scores and predicted scores made by our ML model.
A snippet where all three parameters are taken into account.

According to the results, our machine learning model is suggesting that our two variables are directly proportional to each other. Some overestimation and underestimation in predicted scores can be seen.

E. Testing with Custom Data

A snippet of code where a random number was given for prediction.

As it can be seen, for several 9.25 hours, the predicted score is 93.692 out of 100 (rounded off to the third decimal place).

F. Evaluating the Machine Learning Algorithm

The concluding step is to evaluate the performance of our algorithm. This final step holds high significance to compare the performance of different algorithms on a particular dataset. Many metrics can be used. However, we will be using the mean square error.

A snippet of code evaluating the performance of our algorithm.

Results & Discussion

The project’s main goal was to determine whether a relationship between the two quantitative measures existed. If that was the case, then we had to develop a prediction model for students’ academic performance. Looking back at the findings, we have gained sufficient results from our machine learning model to call our hypothesis to be accepted as true. Thus, there is a positive correlation between the number of hours studied for a subject and the score achieved in that subject’s test.

How can the stakeholders capitalize on our findings?

  • Lecturers can classify students and take early action to improve their performance. All students falling below the median or on the median can be taken under the care of subject experts, to expect a good result in the subject.
  • Due to early prediction and solutions are done, better results can be expected in final exams. Prediction of a student’s academic performance was done independently using his previous record. Such results allow the stakeholders, i.e., the parents and the lecturers to direct their focus solely on those who are lacking the dedication to sit through the hours to learn a subject
  • Reputed companies having a tie-up with the institution can search students according to their requirements. Students who have performed academically can be roped in by the lecturer to guide the struggling students.
  • Systematic approaches can be taken to improve the performance with time. An agile environment with SCRUM methodologies can be brought into practice to increase productivity levels. Clusters can be made amongst those students with each containing at least one student with an above-average academic record.

Conclusion

Deciding and dedicating the best practice and environment to uplift a student’s academic portfolio can be challenging and quite an uphill task due to many uncertainties. However, the abundance of data in this time and age, thanks to the digitization of society where so much human activity is now in the digital realm, along with advanced Machine Learning algorithms have made it easier for us to gain meaningful insights into a topic of our choice and its pertaining entities. This helps everyone; stakeholders, entrepreneurs, and business owners, to make informed decisions backed up by research and facts.

Thank you,

Usman Aftab Khan

Note: Everything in this article is documented in my GitHub repository. Please do pay a visit if you’re interested in deciphering the full code.

--

--

Usman Aftab Khan
CodeX

An avid explorer living at the interstice of business, data, and cutting-edge technology | Find me on LinkedIn: https://www.linkedin.com/in/usmankayy/