13.08.2018 Experiment Reports

Amélie B.
Amélie's Blog
Published in
2 min readAug 14, 2018

Some experiments I ran in Jupyter Notebook in approaching the dataset I was given for my internship. This gave me a much better understanding of how different classifiers work with data.

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn import tree, metrics

from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE

#from sklearn.tree import DecisionTreeClassifier

Read_Data:

spreadsheet_file_path = “/Users/ameliebuc/Documents/byond_internship/ImBlanced-Classification.csv”

data = pd.read_csv(spreadsheet_file_path, encoding = ‘utf-8’)

data.describe()

Load_Data:

# Set your prediction target and the features you’ll need to predict instances

y = data.Label

data_features = [‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’, ‘k’, ‘l’, ‘m’]

# Store the data corresponding to data_features in the dictionary data to X

X = data[data_features]

# How imbalanced is the data?

imb_count = data.Label.value_counts()

print(“The data has the following classes & counts: \n{}”.format(imb_count))

Data Split: (random_state = 90 worked best out of random vals 0, 20, 100, 190)

train_X, val_X, train_y, val_y = train_test_split(X,y,test_size=0.2,random_state=90)

The following are reports of ways I added code to model the algorithm differently:

# 01

Method

Decision Tree Classifier, random_state = 1

No accounting for data imbalance

Metrics

Accuracy: 0.92775206

Precision: 0.14414414

Recall: 0.15920398

Observations

Very fast runtime but low performance. Accuracy high because model predicts mostly 0s, and most of dataset consists of 0s.

# 02

Method

Decision Tree Classifier, random_state = 1

SMOTE, random_state = 12, to account for oversampling

Metrics

Accuracy: 0.91789092

Precision: 0.17554859

Recall: 0.27860697

Observations

  • Increasing randomness of SMOTE to 25 lowered precision but had nearly negligible effects
  • Lowering randomness of SMOTE to 5 lowered precision, increased recall but had nearly negligible effects
  • Increasing randomness of DTC to 10 lowered precision, increased recall but had nearly negligible effects
  • Lowering randomness of DTC to 0 increased recall to 0.29353234
  • All observations on randomness expected with overfitting/underfitting
  • Model is not flexible enough

# 03

Method

Add artificial noise.

random_state = np.random.RandomState(0)

n_samples, n_features = X.shape

X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

svm.LinearSCV

Metrics

Accuracy:~90%

Precision: ~10%

Recall: ~4%

Observations

Extremely slow runtime with (mathematically expected) the worst results. Adding more noise changes little:
Average precision-recall score: 0.0374
Average precision-recall score: 0.0412
Average precision-recall score: 0.0462
Average precision-recall score: 0.0481
Average precision-recall score: 0.0408
Average precision-recall score: 0.0450
Average precision-recall score: 0.0410
Average precision-recall score: 0.0395
Average precision-recall score: 0.0411

# 04

Method

Simple oversampling — copied rows labelled 1 numerous times.

Metrics

Accuracy: 0.97441441

Precision: 0.85238624

Recall: 0.98841699

Observations

  • Data NOT split for this experiment (experimental fault), so precision/recall scores reflect predictions on training data rather than separate test data
  • SMOTE a cleaner method

# 05

Method

RandomForestClassifier, n_estimators = 200, random_state = 3

SMOTE, random_state = 12

Code

sm = SMOTE(random_state=12)

x_train_res, y_train_res = sm.fit_sample(train_X, train_y)

model = RandomForestClassifier(n_estimators=200, criterion = “gini”,random_state=3)

model.fit(x_train_res, y_train_res)

predicted= model.predict(val_X)

Metrics

Accuracy: 0.94666935

Precision: 0.27142857

Recall: 0.18905473

Observations

  • Increasing n_estimators increases runtime significantly and does not increase performance significantly
  • Decreasing randomness of RFC decreases precision
  • Increasing randomness of RFC decreases precision and recall negligibly
  • Increasing/decreasing randomness of SMOTE also insignificant

--

--