Machine Learning algorithm for student grade prediction and visualization using decision tree

Raghu Bayya
Analytics Vidhya
Published in
4 min readDec 21, 2019
Photo by Mikael Kristenson on Unsplash

This article is continuation from part 1

Using decision tree to see, how student number of hours of absences in course will classify students grade.

For any Machine learning mode, its really important to prepare the dataset. If you haven't cleaned and prepossessed your datasets your model will not -work.

Initial step is to import packages(libraries)

source: Jupyter Notebook

From top pandas library used to import data, manager and store dataset as dataframe, Numpy is used for all mathematical things, matplotlib for virtualization data into different bar and graphs, similarly for better visualization seaborn library is best fit. Further to perform machine learning operating on dataset it uses sklearn library from sub packages.

Import and Read Dataset

source : Jupyter Notebook

Dataset stdset.csv has save into pandas pd variable stdrecord, now you can access any were across your python jupyter notebook.

What if ! Missing Values

In machine learning algorithm with missing values in dataset it could not give expected results its really important to fill the missing values. But deleting complete line of missing value is not a smart idea. That could easily cause problem. The most common solution is to take the mean of columns and replace missing value with mean value.

Using Sklearn package which contain powerful machine learning models to perform preprocessing. for data cleaning techniques.

as, from sklearn import preprocessing

What if! Categorical Data

In case of numerical values we can find mean, clean missing values and perform other mathematical operation. Machine learning can only read and understand numerical value. But you can’t take the mean value of “pass”, “redo”, “retake”. We can encode the categorical values in to numeric values but using Label Encoder, One Hot Encoder(sklearn library) and dummy(pandas library).

Apply encoder to one column where ever it is necessary and call the Label name into encoder.

source : Jupyter Notebook

Using Label Encoder, it will convert categorical values “Pass” as 0, “Redo” as 1 and “Retake” as 2 into a single column Finalgrades from our datasets.

Pandas dummy is similar to label Encoder but it splits to each separate column. From sklearn library One Hot Encoder work similarly for categorical values.

source : Jupyter Notebook

Using pandas library concatenating dummy data to actual dataset to see the difference. And to apply for machine learning model.

source : Jupyter Notebook

Splitting datasets to Training set, Validation set and Test set

Training Model

Splitting available dataset into training and test using sklearn.model_selection.train_test_split

source : Jupyter Notebook

Before using machine learning algorithm we must always split data before doing anything else, this is the best way to get reliable estimate of your model performance. After splitting your data you should not use your test dataset until your ready to select your final model.

Training set are used to fit and tune your model. Test set are put aside as to evaluate your model to see how it work in real time.

On the other side we can split validation data set from our training set this is used to validate our model during training process, helps use to give information to adjust hyperameters.

Splitting data for model Training, validation and Testing.
Data set split ratio

Machine Learning Algorithm

After data is split into training and test, now it time to fit the training data to model. Using DecisionTreeClassifier().

source : Jupyter Notebook

Once the mode is fit, we can predict the result using dtfit.predict(x_test) on test dataset.

Performance Measure

Evaluation

In the model evaluation we try to see performance of algorithm using metrics.accuracy_scoure(y_test, y_pred). comparing on both y_test and y_pred datasets.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

source : Jupyter Notebook

Confusion Matrix

source : Jupyter Notebook

Classification Report

source : Jupyter Notebook

Visualizing Decision Tree

Source: Jupyter Notebook

Optimizing Decision Tree

source: Jupyter Notebook
source : Jupyter Notebook

Part 2 : Github/JupyterNotebook source

In next Article part 3, neural network to predict the student grades using no of hours of absence in course and Consultations which influence the student performances. Using keras and Tenserflow.

About Author : Raghu Bayya, Data Scientist ML/Deep Learning.

Expert in Big Data

--

--