APPLYING RANDOM FOREST (CLASSIFICATION) — MACHINE LEARNING ALGORITHM FROM SCRATCH WITH REAL DATASETS

Abilash R
9 min readJul 31, 2018

--

APPLYING RANDOM FOREST (CLASSIFICATION) — MACHINE LEARNING ALGORITHM FROM SCRATCH WITH REAL DATASETS

1. Understanding the datasets

Data Set Information:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Attribute Information:

X1 — Age of patient at time of operation (numerical) X2 — Patient’s year of operation (year — 1900, numerical) X3 — Number of positive axillary nodes detected (numerical) Y — Survival status (class attribute) — 1 = the patient survived 5 years or longer — 2 = the patient died within 5 year

In [ ]:

2. Importing Datasets

In [1]:

import numpy as np
import pandas as pd
df = pd.read_csv("survival.csv")
print(df)
X1 X2 X3 Y
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
5 33 58 10 1
6 33 60 0 1
7 34 59 0 2
8 34 66 9 2
9 34 58 30 1
10 34 60 1 1
11 34 61 10 1
12 34 67 7 1
13 34 60 0 1
14 35 64 13 1
15 35 63 0 1
16 36 60 1 1
17 36 69 0 1
18 37 60 0 1
19 37 63 0 1
20 37 58 0 1
21 37 59 6 1
22 37 60 15 1
23 37 63 0 1
24 38 69 21 2
25 38 59 2 1
26 38 60 0 1
27 38 60 0 1
28 38 62 3 1
29 38 64 1 1
.. .. .. .. ..
276 67 66 0 1
277 67 61 0 1
278 67 65 0 1
279 68 67 0 1
280 68 68 0 1
281 69 67 8 2
282 69 60 0 1
283 69 65 0 1
284 69 66 0 1
285 70 58 0 2
286 70 58 4 2
287 70 66 14 1
288 70 67 0 1
289 70 68 0 1
290 70 59 8 1
291 70 63 0 1
292 71 68 2 1
293 72 63 0 2
294 72 58 0 1
295 72 64 0 1
296 72 67 3 1
297 73 62 0 1
298 73 68 0 1
299 74 65 3 2
300 74 63 0 1
301 75 62 1 1
302 76 67 0 1
303 77 65 3 1
304 78 65 1 2
305 83 58 2 2
[306 rows x 4 columns]

In [ ]:

3. Splitting datas for training

In [2]:

X_train = df[['X1', 'X2', 'X3' ]][:306].values.reshape(306, 3)
y_train = df[['Y']][:306].values.reshape(306, 1)

In [3]:

print("Training data - Input")
print(X_train)
print("\n\nTraining data - Output")
print(y_train)
Training data - Input
[[30 64 1]
[30 62 3]
[30 65 0]
[31 59 2]
[31 65 4]
[33 58 10]
[33 60 0]
[34 59 0]
[34 66 9]
[34 58 30]
[34 60 1]
[34 61 10]
[34 67 7]
[34 60 0]
[35 64 13]
[35 63 0]
[36 60 1]
[36 69 0]
[37 60 0]
[37 63 0]
[37 58 0]
[37 59 6]
[37 60 15]
[37 63 0]
[38 69 21]
[38 59 2]
[38 60 0]
[38 60 0]
[38 62 3]
[38 64 1]
[38 66 0]
[38 66 11]
[38 60 1]
[38 67 5]
[39 66 0]
[39 63 0]
[39 67 0]
[39 58 0]
[39 59 2]
[39 63 4]
[40 58 2]
[40 58 0]
[40 65 0]
[41 60 23]
[41 64 0]
[41 67 0]
[41 58 0]
[41 59 8]
[41 59 0]
[41 64 0]
[41 69 8]
[41 65 0]
[41 65 0]
[42 69 1]
[42 59 0]
[42 58 0]
[42 60 1]
[42 59 2]
[42 61 4]
[42 62 20]
[42 65 0]
[42 63 1]
[43 58 52]
[43 59 2]
[43 64 0]
[43 64 0]
[43 63 14]
[43 64 2]
[43 64 3]
[43 60 0]
[43 63 2]
[43 65 0]
[43 66 4]
[44 64 6]
[44 58 9]
[44 63 19]
[44 61 0]
[44 63 1]
[44 61 0]
[44 67 16]
[45 65 6]
[45 66 0]
[45 67 1]
[45 60 0]
[45 67 0]
[45 59 14]
[45 64 0]
[45 68 0]
[45 67 1]
[46 58 2]
[46 69 3]
[46 62 5]
[46 65 20]
[46 62 0]
[46 58 3]
[46 63 0]
[47 63 23]
[47 62 0]
[47 65 0]
[47 61 0]
[47 63 6]
[47 66 0]
[47 67 0]
[47 58 3]
[47 60 4]
[47 68 4]
[47 66 12]
[48 58 11]
[48 58 11]
[48 67 7]
[48 61 8]
[48 62 2]
[48 64 0]
[48 66 0]
[49 63 0]
[49 64 10]
[49 61 1]
[49 62 0]
[49 66 0]
[49 60 1]
[49 62 1]
[49 63 3]
[49 61 0]
[49 67 1]
[50 63 13]
[50 64 0]
[50 59 0]
[50 61 6]
[50 61 0]
[50 63 1]
[50 58 1]
[50 59 2]
[50 61 0]
[50 64 0]
[50 65 4]
[50 66 1]
[51 59 13]
[51 59 3]
[51 64 7]
[51 59 1]
[51 65 0]
[51 66 1]
[52 69 3]
[52 59 2]
[52 62 3]
[52 66 4]
[52 61 0]
[52 63 4]
[52 69 0]
[52 60 4]
[52 60 5]
[52 62 0]
[52 62 1]
[52 64 0]
[52 65 0]
[52 68 0]
[53 58 4]
[53 65 1]
[53 59 3]
[53 60 9]
[53 63 24]
[53 65 12]
[53 58 1]
[53 60 1]
[53 60 2]
[53 61 1]
[53 63 0]
[54 60 11]
[54 65 23]
[54 65 5]
[54 68 7]
[54 59 7]
[54 60 3]
[54 66 0]
[54 67 46]
[54 62 0]
[54 69 7]
[54 63 19]
[54 58 1]
[54 62 0]
[55 63 6]
[55 68 15]
[55 58 1]
[55 58 0]
[55 58 1]
[55 66 18]
[55 66 0]
[55 69 3]
[55 69 22]
[55 67 1]
[56 65 9]
[56 66 3]
[56 60 0]
[56 66 2]
[56 66 1]
[56 67 0]
[56 60 0]
[57 61 5]
[57 62 14]
[57 64 1]
[57 64 9]
[57 69 0]
[57 61 0]
[57 62 0]
[57 63 0]
[57 64 0]
[57 64 0]
[57 67 0]
[58 59 0]
[58 60 3]
[58 61 1]
[58 67 0]
[58 58 0]
[58 58 3]
[58 61 2]
[59 62 35]
[59 60 0]
[59 63 0]
[59 64 1]
[59 64 4]
[59 64 0]
[59 64 7]
[59 67 3]
[60 59 17]
[60 65 0]
[60 61 1]
[60 67 2]
[60 61 25]
[60 64 0]
[61 62 5]
[61 65 0]
[61 68 1]
[61 59 0]
[61 59 0]
[61 64 0]
[61 65 8]
[61 68 0]
[61 59 0]
[62 59 13]
[62 58 0]
[62 65 19]
[62 62 6]
[62 66 0]
[62 66 0]
[62 58 0]
[63 60 1]
[63 61 0]
[63 62 0]
[63 63 0]
[63 63 0]
[63 66 0]
[63 61 9]
[63 61 28]
[64 58 0]
[64 65 22]
[64 66 0]
[64 61 0]
[64 68 0]
[65 58 0]
[65 61 2]
[65 62 22]
[65 66 15]
[65 58 0]
[65 64 0]
[65 67 0]
[65 59 2]
[65 64 0]
[65 67 1]
[66 58 0]
[66 61 13]
[66 58 0]
[66 58 1]
[66 68 0]
[67 64 8]
[67 63 1]
[67 66 0]
[67 66 0]
[67 61 0]
[67 65 0]
[68 67 0]
[68 68 0]
[69 67 8]
[69 60 0]
[69 65 0]
[69 66 0]
[70 58 0]
[70 58 4]
[70 66 14]
[70 67 0]
[70 68 0]
[70 59 8]
[70 63 0]
[71 68 2]
[72 63 0]
[72 58 0]
[72 64 0]
[72 67 3]
[73 62 0]
[73 68 0]
[74 65 3]
[74 63 0]
[75 62 1]
[76 67 0]
[77 65 3]
[78 65 1]
[83 58 2]]
Training data - Output
[[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[2]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[2]
[2]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[1]
[2]
[2]]

In [ ]:

4. Implementing RANDOM FOREST CLASSIFICATION

In [4]:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [ ]:

5. Fitting the datasets

In [5]:

clf.fit(X_train,y_train.ravel())

Out[5]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)

In [ ]:

6. Predicitng Sample Data

In [6]:

print(clf.predict([[83,58,2]]))[2]

In [ ]:

7. Predicting the data for the trained data

In [7]:

# This will help evaluation of the result
y_pred= clf.predict(X_train)
print(y_pred)
[1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 2
2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2 2 1
1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 1 1
1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1
1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2
2 2 2 2 1 1 1 1 1 2 2 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 2 1 1
1 1 1 2 1 1 1 1 2 2]

In [ ]:

8. Report Generation

In [8]:

from sklearn.metrics import classification_report
report = classification_report(y_train, y_pred)
print(report)
precision recall f1-score support 1 0.97 0.98 0.98 225
2 0.95 0.91 0.93 81
avg / total 0.96 0.96 0.96 306

--

--