Feature Engineering & Data Pre-Processing: Outliers

Sedefozcan
6 min readJul 30, 2022

--

“If your data is bad, your machine learning tools are useless”

Thomas C. Redman

The quality of the data is really important in Data Science. If the data is bad, probably it is unexpected to get good results. Therefore whatever we deal with, it is necessary to have a good data set. Additionally, within the scope of feature engineering and data pre-processing, we also need potential data.

“The word’s most valuable resource is no longer oil, but data.”

We will examine the following.

  • Outliers
  • Missing Values
  • Encoding
  • Feature Scaling
  • Feature Extraction
  • Feature Interactions
  • End-to-End Application

Let’s talk about feature engineering.

“Applied machine learning is basically feature engineering.”

Andrew Ng

What is feature engineering?

It is working on features or producing variables from the raw data (feature extraction).

What is data pre-processing?

It is the process of preparing the data before the work.

Feature Engineering is one of the steps in data pre-processing. Not only for machine learning but also in many data science applications, steps of data pre-processing are necessary. Hence, learning these steps will solve many problems. Furthermore, the most time-taking period in any application is data pre-processing. Even 80 % percent of the job is preparing data and 20 % is modeling.

Outliers

An outlier is a data point that differs significantly from other observations.

What do outliers cause?

In the first model, there is no outlier and the relationship between x and y is modeled by the horizontal line. If we add 3 outliers to the data set, the line demonstrating the relationship is changed. Especially, in linear models, the effect of outliers is much, which is not true in the tree models.

How can we decide that a value is an outlier?

The things that need to be considered are:

  1. Business Information
  2. Standard Deviation Approach: One may define the outliers as the values outside the interval between mean — 2*std and mean + 2* std ( 2 or 3 times of the standard deviation may work, depending on the case).
  3. Z-score Approach: The relevant variable is normalized. Then the mean is changed to 0. The values outside (-2.5, 2.5) are outliers.
  4. Box plot (Interquartile range- IQR) Method (one variable)
  5. LOF (multi-variable)

The most commonly used one is the box plot method. First, we sort the data descending and determine the quartile elements Q1 (25%), Q2 (50%), Q3 (75%), and Q4 (100%). Then the distance between Q1 and Q3 is the interquartile range (IQR). The outliers are the data values lying outside the interval (Q1–1.5*IQR, Q3+1.5*IQR).

Indeed, to determine the outliers, the critical point is to decide what is the threshold for our work. We prefer the box plot method which is used the most in the literature. If there is no negative value in the variable, generally the lower limit is not used. We only mind the upper limit. For example, for an age variable, if there is no negativity, then we know that our lower bound will be 0. So it is enough to decide the upper bound.

Catching Outliers

We use two data sets. One of them is relatively small while the other is big.

def load():
data=pd.read_csv("path of csv")
return data

df = load()

Graph Technique

We draw the box plot.

sns.boxplot(x=df["Age"])
plt.show()

It seems to be that there are outliers. We want to get these values programmatically, too.

# determining quartiles
q1 = df["Age"].quantile(0.25) # Q1
q3 = df["Age"].quantile(0.75) # Q3
iqr = q3 - q1 # interquartile range
up = q3 + 1.5 * iqr # upper limit
low = q1 - 1.5 * iqr. # lower limit

There are some outliers but is no value smaller than the lower limit.

But what if we have lots of variables? It would be great to write a function doing all the processes functionally.

def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
quartile1 = dataframe[col_name].quantile(q1)
quartile3 = dataframe[col_name].quantile(q3)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit

outlier_thresholds(df, "Age")
-----------------------------
Out[28]: (-6.6875, 64.8125)

At this point note that the quantiles may change. Sometimes we took the values between the percentages of 5 % and 95 %. It depends on the problem that we handle.

Are there any outliers?

def check_outlier(dataframe, col_name):
low_limit, up_limit = outlier_thresholds(dataframe, col_name)
if dataframe[(dataframe[col_name] > up_limit) |
(dataframe[col_name] < low_limit)].any(axis=None):
return True
else:
return False
check_outlier(df, "Age")
-------------------------
Out[30]: True

In this function, we use the outlier_threshold function. Recall that the interquartile range may differ from data to data. So if we want to set them specifically, we have to add two parameters in the check_outlier function; q1 and q3.

What if there are hundreds of variables? How can we test all of them? For this, we create another function that separates all the variables into lists according to their types.

def grab_col_names(dataframe, cat_th=10, car_th=20):
"""

It gives the names of categorical, numerical and cardinal variables.
Note: There could be categorical variables that look like numerical variables.

Parameters
------
dataframe: dataframe
the dataframe whose names of variables will be taken
cat_th: int, optional
the threshold of the number of classes for numerical but categorical variables
car_th: int,optional
the threshold of the number of classes for categorical but cardinal

Returns
------
cat_cols: list
list of categorical variables
num_cols: list
list of numerical variables
cat_but_car: list
list of cardinal variables

Examples
------
import seaborn as sns
df = sns.load_dataset("iris")
print(grab_col_names(df))


Notes
------
cat_cols + num_cols + cat_but_car = total number of variables
in num_but_cat cat_cols.

"""

# cat_cols, cat_but_car
cat_cols = [col for col in dataframe.columns if
dataframe[col].dtypes == "O"]
num_but_cat = [col for col in dataframe.columns if
dataframe[col].nunique() < cat_th and
dataframe[col].dtypes != "O"]
cat_but_car = [col for col in dataframe.columns if
dataframe[col].nunique() > car_th and
dataframe[col].dtypes == "O"]
cat_cols = cat_cols + num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]

# num_cols
num_cols = [col for col in dataframe.columns if
dataframe[col].dtypes != "O"]
num_cols = [col for col in num_cols if col not in num_but_cat]

print(f"Observations: {dataframe.shape[0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f'cat_cols: {len(cat_cols)}')
print(f'num_cols: {len(num_cols)}')
print(f'cat_but_car: {len(cat_but_car)}')
print(f'num_but_cat: {len(num_but_cat)}')
return cat_cols, num_cols, cat_but_car

Let’s apply this function to the other data set:

def load_application_train():
data = pd.read_csv("path of csv")
return data

df = load_application_train()
df.head()
cat_cols, num_cols, cat_but_car = grab_col_names(df)--------------------------------------------------
Observations: 307511
Variables: 122
cat_cols: 54
num_cols: 67
cat_but_car: 1
num_but_cat: 39

Now we can write loops that check all the variables.

for col in num_cols:
print(col, check_outlier(df, col))

Grabbing Outliers

In the previous part, we questioned the types of variables and checked whether there are any missing values or not. Now we create a function that catches the indices of missing values.

def grab_outliers(dataframe, col_name, index=False):
low, up = outlier_thresholds(dataframe, col_name)

if dataframe[((dataframe[col_name] < low) | (dataframe[col_name]
> up))].shape[0] > 10:
print(dataframe[((dataframe[col_name] < low) |
(dataframe[col_name] > up))].head())
else:
print(dataframe[((dataframe[col_name] < low) |
(dataframe[col_name] > up))])

if index:
outlier_index = dataframe[((dataframe[col_name] < low) |
(dataframe[col_name] > up))].index
return outlier_index

Solving Outlier Problem

Removing

One may choose to clean all missing values from the data.

def remove_outlier(dataframe, col_name):
low_limit, up_limit = outlier_thresholds(dataframe, col_name)
df_without_outliers = dataframe[~((dataframe[col_name] <
low_limit) | (dataframe[col_name] > up_limit))]
return df_without_outliers

Here the problem is that even if there is only one missing value, we remove the row. So it may cause losing some part of the data, which is not preferred generally. Instead, we could re-assign outliers in the following way.

Re-assignment with thresholds

def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < low_limit), variable] =
low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] =
up_limit

Recap

df = load()
outlier_thresholds(df, "Age") # determining the thresholds
check_outlier(df, "Age") # checking
grab_outliers(df, "Age", index=True) # listing variables

remove_outlier(df, "Age").shape # removing null values
replace_with_thresholds(df, "Age") # re-assigning method
check_outlier(df, "Age")

--

--