Feature Engineering — 1 | The Silent Killers: Outliers!

We got a runner!
  1. Data entry /An experimental measurement error
  1. Interquartile Range (IQR) Method
sub     values
0 IQR 17.000000
1 Upper Bound 66.500000
2 Lower Bound -1.500000
3 Sum outliers 9.000000
4 percentage outliers 1.171875
  • 68% of the data points lie between +/- 1 standard deviation,
  • 95% of the data points lie between +/- 2 standard deviation,
  • 99.7% of the data points lie between +/- 3 standard deviation
  1. When given a dataset, a random sub-sample of the data is selected and assigned to a binary tree.
  2. Branching of the tree starts by selecting a random feature (from the set of all N features) first. And then branching is done on a random threshold ( any value in the range of minimum and maximum values of the selected feature).
  3. If the value of a data point is less than the selected threshold, it goes to the left branch else to the right. And thus a node is split into left and right branches.
  4. This process from step 2 is continued recursively till each data point is completely isolated or till max depth(if defined) is reached.
  5. The above steps are repeated to construct random binary trees.[7]
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import IsolationForest
data = pd.read_csv('hitters.csv')
{'behaviour': 'deprecated', 'bootstrap': False, 'contamination': 'auto', 'max_features': 1.0, 'max_samples': 'auto', 'n_estimators': 100, 'n_jobs': None, 'random_state': None, 'verbose': 0, 'warm_start': False}
data[["CRuns", "anomaly_score"]][data["anomaly_score"]==-1]




Buğra Alp Nas

Buğra Alp Nas

