Machine Learning algorithm for student grade prediction and visualization using decision tree

Raghu Bayya

Published in

Analytics Vidhya

4 min readDec 21, 2019

This article is continuation from part 1

Using decision tree to see, how student number of hours of absences in course will classify students grade.

For any Machine learning mode, its really important to prepare the dataset. If you haven't cleaned and prepossessed your datasets your model will not -work.

Initial step is to import packages(libraries)

From top pandas library used to import data, manager and store dataset as dataframe, Numpy is used for all mathematical things, matplotlib for virtualization data into different bar and graphs, similarly for better visualization seaborn library is best fit. Further to perform machine learning operating on dataset it uses sklearn library from sub packages.

Import and Read Dataset

Dataset stdset.csv has save into pandas pd variable stdrecord, now you can access any were across your python jupyter notebook.

What if ! Missing Values

In machine learning algorithm with missing values in dataset it could not give expected results its really important to fill the missing values. But deleting complete line of missing value is not a smart idea. That could easily cause problem. The most common solution is to take the mean of columns and replace missing value with mean value.

Using Sklearn package which contain powerful machine learning models to perform preprocessing. for data cleaning techniques.

as, from sklearn import preprocessing

What if! Categorical Data

In case of numerical values we can find mean, clean missing values and perform other mathematical operation. Machine learning can only read and understand numerical value. But you can’t take the mean value of “pass”, “redo”, “retake”. We can encode the categorical values in to numeric values but using Label Encoder, One Hot Encoder(sklearn library) and dummy(pandas library).

Apply encoder to one column where ever it is necessary and call the Label name into encoder.

Using Label Encoder, it will convert categorical values “Pass” as 0, “Redo” as 1 and “Retake” as 2 into a single column Finalgrades from our datasets.

Pandas dummy is similar to label Encoder but it splits to each separate column. From sklearn library One Hot Encoder work similarly for categorical values.

Using pandas library concatenating dummy data to actual dataset to see the difference. And to apply for machine learning model.

Splitting datasets to Training set, Validation set and Test set

Training Model

Splitting available dataset into training and test using sklearn.model_selection.train_test_split

Before using machine learning algorithm we must always split data before doing anything else, this is the best way to get reliable estimate of your model performance. After splitting your data you should not use your test dataset until your ready to select your final model.

Training set are used to fit and tune your model. Test set are put aside as to evaluate your model to see how it work in real time.

On the other side we can split validation data set from our training set this is used to validate our model during training process, helps use to give information to adjust hyperameters.

Splitting data for model Training, validation and Testing.

Machine Learning Algorithm

After data is split into training and test, now it time to fit the training data to model. Using DecisionTreeClassifier().

Once the mode is fit, we can predict the result using dtfit.predict(x_test) on test dataset.

Performance Measure

Evaluation

In the model evaluation we try to see performance of algorithm using metrics.accuracy_scoure(y_test, y_pred). comparing on both y_test and y_pred datasets.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Confusion Matrix

Classification Report

Visualizing Decision Tree

Optimizing Decision Tree

Part 2 : Github/JupyterNotebook source

In next Article part 3, neural network to predict the student grades using no of hours of absence in course and Consultations which influence the student performances. Using keras and Tenserflow.

About Author : Raghu Bayya, Data Scientist ML/Deep Learning.

Expert in Big Data

Machine Learning algorithm for student grade prediction and visualization using decision tree

Written by Raghu Bayya