A Primer On Machine Learning and Data Analysis
Hi, I’m Shaythuram a Singaporean ReactJS developer with a strong interest in Machine Learning, Data Analysis and all things tech. This is my first step into the world of tech blogging and I have decided to use Medium as a way to document my learning journey in the field of ML/AI and Data Analytics.
This article serves as a guide to solve this problem I found on Kaggle. I hope you enjoy reading it as much as I enjoyed making it.
The original code for this was written in GoogleColab so certain nuances may be specific to that environment. Nonetheless, the code should work perfectly fine elsewhere. Do feel free to drop me an email at shaythuram@gmail.com for any help,clarifications and most importantly improvements. The full code for this can be found on here along with the dataset.
After uploading your dataset into the working directory, we can import all the modules we’ll be using.
import os
import urllib
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from pandas.plotting import scatter_matrix
from google.colab import files
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings("ignore")
os.listdir()
The os.listdir() function is there just to show us that the dataset is in our working dir.
data_location = 'test_scores.csv' #location of Data/csv filedef load_test_scores(data = data_location ):
#function to load data into pandas DataFrame
df = pd.read_csv(data_location)
return df
df = load_test_scores()
n_student refers to the number of students in this students class. We can change this column name to n_students_in_class.
df.rename(columns={"n_student":'n_students_in_class'} ,inplace=True)
The inplace=True param allows us to change the current df, without this we would have to create a new variable ans assign the the changed df to it.
%matplotlib inline #This line of code just makes our plot more #easily viewable
df.hist(bins=50 , figsize=(20,15) )
plt.show()
The above code aids us in visualising our non-categorical data namely the n_students_in_class along with the pretest and posttest scores. The posttest scores are what we want to predict and will be refered to as the label from here on out.
The next step would be to find out which of these features are most related to our label, the posttest scores. To do this we will use a metric called the Pearson Correlation Coefficient. (PCC) Simply put this gives us a range of values between -1 and +1. With +1 indication that the a certain feature is very postitively correlated to our target feature and -1 meaning that a certain feature is very negatively correlated. The diff between the negative and positive is the direction of the slope. -0.7 and +0.7 are both an indication of correlations worth looking into whereas features with -0.1 and +0.1 PCC would be considered as low correlation features. The code to do this is found below.
corr_matrix = df.corr()
corr_matrix['posttest'].sort_values(ascending=False)
Apart from PCC we can come up with a scatterplot of all our features to unearth some other relationships or interesting facts about our data.
%matplotlib inline
pd.plotting.scatter_matrix(df, figsize=(20,15) )
plt.show()
From the PCC and the plot above we can deduce that the pre and posttest results are highly correlated. The PCC should have revealed a moderate negative correlation between the n_students_in_class and the label. However it is hard to deduce this from the scatterplot alone.
One thing to take note of is the fact that the n_students_in_class and the classroom feature essentially reveal the same information. This is because, for any given datapoint with say classroom A and n_students_in_class B, the two values are constant. (Classroom A will always have B number of students in it).
Thus, it would be fine to drop one off these features. We shall drop the n_students_in_class column from our DataFrame.
df.drop('n_students_in_class', inplace=True , axis=1)
axis=1 refers to a column while axis=0 refers to a horizontal row, a single datapoint.
Now we are going to have to split our dataset into a train set and a test set. Based on the Pearson Correlation Coefficient, the pretest scores are a very strong indicator of the posttest scores. Hence we shall use Stratified Shuffling to split our dataset. That way our model will be trained on data with a distribution is similar to that of actual data.
To do this we can segment the pretest scores into bins or thresholds. The bins will be 10 marks apart with the lower limit containing any scores less than 30.
Each bar of the histogram tells us how many students had pretest scores between the given range. We will model this distribution pretest scores in our train and test set.
model_this_distribution = pd.cut(df['pretest'],bins=[0 , 30. , 40. , 50. , 60. , 70. ,80. , 90. ,np.inf ],labels = ['<30' , '<40','<50','<60','<70','<80','<90' , '<100'])%matplotlib inline
model_this_distribution.hist(grid=True, bins=20, rwidth=0.9)
plt.show()
Now let’s split our dataset into the train and test sets using the distribution above.
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)for train,test in split.split(df , model_this_distribution):
strat_train_set = df.loc[train]
strat_test_set = df.loc[test]
Data Handling and Transformation
Earlier we spoke about how lucky we were to have a complete dataset with no missing or null values. However in the even that we do, scikit-learn provides us with simple tools to help fill in this missing data. This is called a SimpleImputer. It can use the mean,median or mode strategy to fill in missing numerical data or the ‘most_frequent’ strategy for categorical data.
For categorical data, most Machine Learning algorithms prefer numerical input over text inputs. To address this, we can use OneHotEncoding.
Columns to encode :
1. gender
2. school_type
3. teaching_method
4. lunch(Whether the student qualifies for reduced/free lunch or not)
5. school_setting
For items 1–4, the values are binary so we can drop each column from the 2 columns produced when each one of the 4 columns are encoded.We do not need to do this for the school_settings column.
OneHotEncoder_Instance = OneHotEncoder()df_Encoded = OneHotEncoder_Instance.fit_transform(df[['gender' , 'school_type' , 'teaching_method' , 'lunch' , 'school_setting' ]])df_Encoded = pd.DataFrame.sparse.from_spmatrix(df_Encoded ).drop(axis=1 , columns=[0,2,4,6]).rename(columns={1: "gender", 3: "school_type" , 5:'teaching_method' , 7:'lunch' , 8:'Rural' , 9:'Suburban' , 10:'Urban'} )
This set of code above works well. We shall incorporate this into a custom transformer.
class HotEncodingCleaner(BaseEstimator, TransformerMixin): def fit(self, X, y=None):
return self # nothing else to do def transform(self, X):
df = pd.DataFrame.sparse.from_spmatrix(X )
df.drop(axis=1 , columns=[0,2,4,6] , inplace=True)
return df
By importing the BaseEstimator, TransformerMixin we have a ready made transformer that comes with the fit_transform function. All we would need to do is define what the transform method does which would be to drop the dummy columns. The fit function should always return self.
We could have done the dropping of the columns above in a function and applied it to our dataframe after our encoding. Using this method we can plug this into the Pipeline feature that scikit-learn provides us with.
We also have to encode is the school and classroom features since we need to feed our algorithm numerical values only. However, these features contain many unique values as such it would be a massive waste of space to use OneHotEncoding. Instead we will use OrdinalEncoding, this will just assign each unique value a unique integer value and the data in the columns will be replaced with the corresponding integers.
categorical_attributes = ['gender' , 'school_type' , 'teaching_method' , 'lunch' , 'school_setting' ]
Ordinal_encoding_attributes = ['school' , 'classroom']
num_attributes = ['pretest' , 'posttest']cat_pipeline = Pipeline([("cat", OneHotEncoder()),
('custom' , HotEncodingCleaner())])Ordinal_pipeline = Pipeline([('ordinal' , OrdinalEncoder()),])full_pipeline = ColumnTransformer( [("cat_pipeline", cat_pipeline, categorical_attributes ),("Ordinal_pipeline", Ordinal_pipeline ,Ordinal_encoding_attributes),('imputer', SimpleImputer(strategy="median") , num_attributes ),] , remainder='passthrough')
We created 2 pipelines that handled 2 different groups of columns since one set of columns needed to be OneHotEncoded and cleaned while the other set of columns(Ordinal_encoding_attributes) just needed to undergo Ordinal Encoding. We then merged these 2 into one pipeline using the ColumnTransformer.
We also did (‘imputer’, SimpleImputer(strategy=‘median’) , num_attributes ). This was to handle any missing data in our pretest and posttest columns. We used the median strategy but you could use the ‘mean’ or ‘mode’ strategy too.
remainder=‘passthrough’ is important as it ensures that columns that havent been ‘touched’ , columns that are not in num_attributes, categorical_attributes or Ordinal_encoding_attributes are still kept in the dataframe. By default this value is ‘drop’.
Now we shall apply this full pipeline to our stratified_train_set and then we will make the transformed DataFrame more intuitive.We will also isolate the input data and the labels.
strat_train_treated = full_pipeline.fit_transform(strat_train_set)strat_train_treated_df = pd.DataFrame(strat_train_treated , columns=['gender' , 'school_type' , 'teaching_method' , 'lunch' , 'Rural' , 'Suburban' , 'Urban' , 'school' , 'classroom' , 'pretest' , 'posttest' , 'student_id' ] )strat_train_treated_prepared = strat_train_treated_df.drop(columns=['posttest' , 'student_id' ])strat_train_treated_labels = strat_train_treated_df.loc[:,['posttest']]
df.iloc[:,-1] picks the last column which is the posttest values, we want to isolate these values as these are the ‘correct values’ for the test scores.
For this problem we will be using the RandomForest Regressor. This is a widely used algorithm, more details can be found here.
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)forest_reg.fit(strat_train_treated_prepared , strat_train_treated_labels)predictions_train_set = forest_reg.predict(strat_train_treated_prepared)forestreg_rmse_train = np.sqrt(mean_squared_error(strat_train_treated_labels , predictions_train_set ))print('The RMSE on the training set is {}'.format(forestreg_rmse_train))
We can also see how this model performs on the Test Set.
strat_test_treated = full_pipeline.transform(strat_test_set)strat_test_treated_df = pd.DataFrame(strat_test_treated , columns=['gender' , 'school_type' , 'teaching_method' , 'lunch' , 'Rural' , 'Suburban' , 'Urban' , 'school' , 'classroom' , 'pretest' , 'posttest' , 'student_id' ] )strat_test_treated_prepared = strat_test_treated_df.drop(columns=['student_id' , 'posttest' ])strat_test_treated_labels = strat_test_treated_df.loc[:,['posttest']]
Here we do not have to do fit_transform to our strat_test_set this is because we want our pipeline to have been fitted to our training set only and just apply these transformation to our test set.
NEVER FIT THE PIPELINE OR THE MODEL TO THE TEST SET !!!
predictions_test_set = forest_reg.predict(strat_test_treated_prepared)forestreg_rmse_test = np.sqrt(mean_squared_error(strat_test_treated_labels , predictions_test_set ))print('The RMSE on the test set is {}'.format(forestreg_rmse_test))
RMSE alone is not a good enough metric to evaluate our model, we can use K-fold Cross-Validation.
Evaluating using K-fold Cross-Validation
In K-fold Cross-Validation we split the training set randomly into K number of folds of equal size or thereabout.(K is set as cv in the cross_val_score Instance) Then we train and evaluate our lin_regressor 10 times. Each time, we pick 9 of the folds to train the lin_regressor on and then test it on the fold we did not pick.
For K-fold Cross-Validation, Scikit-Learn expects a utility function(greater is better) instead of a cost function(smaller is better). root_mean_squared_error is a cost function so its value will be negative when calculated by the cross-validator. Hence, scoring=”neg_root_mean_squared_error”. As shown below there is a simple workaround.
def scores_breakdown(scores_Linear , name=''):
#A simple function to evaluate our K-fold Cross-Validation scores scores_mean = scores_Linear.mean() scores_std = scores_Linear.std() print('Mean and Standard deviation of the k-fold scores of the {}
set is {} and {} respectively'.format(name , scores_mean ,
scores_std ))cross_val_score_train = cross_val_score(forest_reg, strat_train_treated_prepared , strat_train_treated_labels , scoring="neg_root_mean_squared_error", cv=10)cross_val_score_train = -cross_val_score_train
#Workaround for the negative values problem
scores_breakdown(cross_val_score_train , name='train')cross_val_score_test = cross_val_score(forest_reg, strat_test_treated_prepared , strat_test_treated_labels , scoring="neg_root_mean_squared_error", cv=10)
cross_val_score_test = -cross_val_score_test
#Workaround for the negative values problemscores_breakdown(cross_val_score_test , name='test')
While carrying out the steps above I noticed that, the mean k-fold score of our training set isn’t that great.
On a better note, the difference between our train and test set’s mean k-fold score isn’t too large. This means the algorithm hasn’t overfitted to our training set.
Fine Tuning
We could manually fiddle with the hyperparameters of the Algorithm to give us a better model and thus better predictions.
For the RandomForest Algorithm these are the hyperparameters.
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion=’mse’, max_depth=None, max_features=’auto’,max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False) )
Changing each one of these features manually would take alot of time, however we can use GridSearchCV to explore any combinations of the hyperparameters above and do K-fold Cross-Validation to evaluate the model with a given set of hyperparameters.
param_grid = [# try 24 (2×3×4) combinations of hyperparameters{'bootstrap': [False,True], 'n_estimators': [30,45,50], 'max_features': [2, 4, 6, 8]},# then try 12 (2×2×3) combinations with bootstrap set as False{'bootstrap': [False,True], 'n_estimators': [3, 10], 'max_features': [8,11 ,7 ]},]forest_reg_GCSV = RandomForestRegressor(random_state=42)# train across 5 folds, that's a total of (36)*5=180 rounds of trainingRF_grid_search = GridSearchCV(forest_reg_GCSV, param_grid, cv=5,
scoring='neg_root_mean_squared_error',
return_train_score=True)RF_grid_search.fit(strat_train_treated_prepared , strat_train_treated_labels)
We can get the best combination of parameters,the best estimator and the best score from out RF_grid_search by follwing the steps below
print('Best Parameters are: {}, \nthe best estimator is: {} \nand the best score is: {}'.format(RF_grid_search.best_params_ , RF_grid_search.best_estimator_ , -RF_grid_search.best_score_) )
This approach is fine when we are exloring a handfull of combinations for our hyperparameters. To cast a wider net, we can use RandomizedSearchCV instead.
Analysing the model and its errors
The RandomForestRegressor can tell us the how important each feature is in the process of making accurate prediction.
We can visualize the relative importance of each feature in the decision making process using the steps below.
relative_feature_importance = RF_grid_search.best_estimator_.feature_importances_.tolist()feature_dict = dict(zip(strat_train_treated_prepared.columns, relative_feature_importance))feature_dict = sorted(feature_dict.items(), key=lambda x: x[1] , reverse=True) #Sortedfeature_dict_df = pd.DataFrame.from_dict(feature_dict )feature_dict_df.rename(inplace=True , columns={0:'features' , 1:'feature_importance'})%matplotlib inlinefeature_dict_df.plot.bar(x='features' , y='feature_importance' , figsize=(15,15) )plt.show()
The first 3 lines of code extract the relative feature importance from the instance of the RandomForest Regressor and creates a dictionary of each column and its corresponding importance. Subsequently we order this dictionary and plot the values as a bar chart.
To close this off, we shall now try our best model on our train and test set and see how it performs.
best_model = RF_grid_search.best_estimator_cross_val_score_train = cross_val_score(best_model, strat_train_treated_prepared , strat_train_treated_labels , scoring="neg_root_mean_squared_error", cv=10 )
cross_val_score_train = -cross_val_score_train
scores_breakdown(cross_val_score_train , name='train')cross_val_score_test = cross_val_score(best_model, strat_test_treated_prepared , strat_test_treated_labels , scoring="neg_root_mean_squared_error", cv=10)
cross_val_score_test = -cross_val_score_test
scores_breakdown(cross_val_score_test , name='test')
Let’s visualize the discrepancy between our predicted and actual scores on both the train and test set.
Let’s use our best_model and do some predictions on our training set.
best_model = RF_grid_search.best_estimator_
predictions_train_set = best_model.predict(strat_train_treated_prepared)train_labels = strat_train_treated_labels.iloc[:,0].astype('int')predictions_test_set = best_model.predict(strat_test_treated_prepared)test_labels = strat_test_treated_labels.iloc[:,0].astype('int')
strat_train_treated_labels.iloc[:,0]) would return us a series of objects, the objects being our labels. However, for plotting these values we would need them to be integers hence we do .astype(‘int’) to typecast the series values into the integer type.
Below we have a custom function that helps to plot both our predicted scores and the labels (actual or ‘correct’ posttest scores) as a line graph.
def plot_comparison(predictions,labels , split_plots=False ): data_dict_predictedvactual = { 'predicted':predictions,
'actual':labels
} predictedvactual =
pd.DataFrame.from_dict(data_dict_predictedvactual )
%matplotlib inline
predictedvactual.plot.line(figsize=(20,15) ,
subplots=split_plots)
return plt.show()plot_comparison(predictions_test_set ,test_labels )
Here we have function that takes 3 inputs, predictions(which are the predictions made by the model), labels and a boolean variable called split_plots. By default, split_plots is set to False. Personally I feel that viewing the two predicted values on the same graph can serve as a visualisation to how good or bad our model is but you can have them on two separate graphs by setting this variable to True.
This concludes the article. I hope you enjoyed following along, the link to my github repo can be found here. It contains this full script as a .ipynb and .py file along with the dataset.
Do feel free to email me at Shaythuram@gmail.com for any clarifications or feedback. Here’s my LinkedIn too !!