13.08.2018 Experiment Reports
Some experiments I ran in Jupyter Notebook in approaching the dataset I was given for my internship. This gave me a much better understanding of how different classifiers work with data.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree, metrics
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
#from sklearn.tree import DecisionTreeClassifier
Read_Data:
spreadsheet_file_path = “/Users/ameliebuc/Documents/byond_internship/ImBlanced-Classification.csv”
data = pd.read_csv(spreadsheet_file_path, encoding = ‘utf-8’)
data.describe()
Load_Data:
# Set your prediction target and the features you’ll need to predict instances
y = data.Label
data_features = [‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’, ‘k’, ‘l’, ‘m’]
# Store the data corresponding to data_features in the dictionary data to X
X = data[data_features]
# How imbalanced is the data?
imb_count = data.Label.value_counts()
print(“The data has the following classes & counts: \n{}”.format(imb_count))
Data Split: (random_state = 90 worked best out of random vals 0, 20, 100, 190)
train_X, val_X, train_y, val_y = train_test_split(X,y,test_size=0.2,random_state=90)
The following are reports of ways I added code to model the algorithm differently:
# 01
Method
Decision Tree Classifier, random_state = 1
No accounting for data imbalance
Metrics
Accuracy: 0.92775206
Precision: 0.14414414
Recall: 0.15920398
Observations
Very fast runtime but low performance. Accuracy high because model predicts mostly 0s, and most of dataset consists of 0s.
# 02
Method
Decision Tree Classifier, random_state = 1
SMOTE, random_state = 12, to account for oversampling
Metrics
Accuracy: 0.91789092
Precision: 0.17554859
Recall: 0.27860697
Observations
- Increasing randomness of SMOTE to 25 lowered precision but had nearly negligible effects
- Lowering randomness of SMOTE to 5 lowered precision, increased recall but had nearly negligible effects
- Increasing randomness of DTC to 10 lowered precision, increased recall but had nearly negligible effects
- Lowering randomness of DTC to 0 increased recall to 0.29353234
- All observations on randomness expected with overfitting/underfitting
- Model is not flexible enough
# 03
Method
Add artificial noise.
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
svm.LinearSCV
Metrics
Accuracy:~90%
Precision: ~10%
Recall: ~4%
Observations
Extremely slow runtime with (mathematically expected) the worst results. Adding more noise changes little:
Average precision-recall score: 0.0374
Average precision-recall score: 0.0412
Average precision-recall score: 0.0462
Average precision-recall score: 0.0481
Average precision-recall score: 0.0408
Average precision-recall score: 0.0450
Average precision-recall score: 0.0410
Average precision-recall score: 0.0395
Average precision-recall score: 0.0411
# 04
Method
Simple oversampling — copied rows labelled 1 numerous times.
Metrics
Accuracy: 0.97441441
Precision: 0.85238624
Recall: 0.98841699
Observations
- Data NOT split for this experiment (experimental fault), so precision/recall scores reflect predictions on training data rather than separate test data
- SMOTE a cleaner method
# 05
Method
RandomForestClassifier, n_estimators = 200, random_state = 3
SMOTE, random_state = 12
Code
sm = SMOTE(random_state=12)
x_train_res, y_train_res = sm.fit_sample(train_X, train_y)
model = RandomForestClassifier(n_estimators=200, criterion = “gini”,random_state=3)
model.fit(x_train_res, y_train_res)
predicted= model.predict(val_X)
Metrics
Accuracy: 0.94666935
Precision: 0.27142857
Recall: 0.18905473
Observations
- Increasing n_estimators increases runtime significantly and does not increase performance significantly
- Decreasing randomness of RFC decreases precision
- Increasing randomness of RFC decreases precision and recall negligibly
- Increasing/decreasing randomness of SMOTE also insignificant