Feature Engineering — 1 | The Silent Killers: Outliers!

We got a runner!
  1. Data entry /An experimental measurement error
  1. Interquartile Range (IQR) Method
sub     values
0 IQR 17.000000
1 Upper Bound 66.500000
2 Lower Bound -1.500000
3 Sum outliers 9.000000
4 percentage outliers 1.171875
  • 68% of the data points lie between +/- 1 standard deviation,
  • 95% of the data points lie between +/- 2 standard deviation,
  • 99.7% of the data points lie between +/- 3 standard deviation
  1. When given a dataset, a random sub-sample of the data is selected and assigned to a binary tree.
  2. Branching of the tree starts by selecting a random feature (from the set of all N features) first. And then branching is done on a random threshold ( any value in the range of minimum and maximum values of the selected feature).
  3. If the value of a data point is less than the selected threshold, it goes to the left branch else to the right. And thus a node is split into left and right branches.
  4. This process from step 2 is continued recursively till each data point is completely isolated or till max depth(if defined) is reached.
  5. The above steps are repeated to construct random binary trees.[7]
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import IsolationForest
data = pd.read_csv('hitters.csv')
data.head(10)
sns.boxplot(data.CRuns)
model=IsolationForest()
model.fit(data[['CRuns']])
print(model.get_params())
{'behaviour': 'deprecated', 'bootstrap': False, 'contamination': 'auto', 'max_features': 1.0, 'max_samples': 'auto', 'n_estimators': 100, 'n_jobs': None, 'random_state': None, 'verbose': 0, 'warm_start': False}
data[["CRuns", "anomaly_score"]][data["anomaly_score"]==-1]

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Visualizing Brain Waves with Python

The Quad Chart as a Strategic Planning Tool

Examples Confounding Factors Research Paper

Knowledge Models and Causal Diagrams

Data Transformations: Centering & Scaling

[ Basic Data Cleaning/Engineering Session ] Twitter Sentiment Data

CARE Water+ Dashboard and Data Visualization Project

Python Packages To Make Your Life Easier

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Buğra Alp Nas

Buğra Alp Nas

More from Medium

Significance of choosing an Error / Evaluation metrics Part — 1

Learn more about the Data Scientist position at Mytraffic thanks to David Tang

Understanding the Bias-Variance trade-off using R

How To Randomly Sample Data Points (Uniform Distribution)