Feature Engineering & Data Pre-Processing: Outliers

6 min readJul 30, 2022

“If your data is bad, your machine learning tools are useless”

Thomas C. Redman

The quality of the data is really important in Data Science. If the data is bad, probably it is unexpected to get good results. Therefore whatever we deal with, it is necessary to have a good data set. Additionally, within the scope of feature engineering and data pre-processing, we also need potential data.

“The word’s most valuable resource is no longer oil, but data.”

We will examine the following.

Outliers
Missing Values
Encoding
Feature Scaling
Feature Extraction
Feature Interactions
End-to-End Application

Let’s talk about feature engineering.

“Applied machine learning is basically feature engineering.”

Andrew Ng

What is feature engineering?

It is working on features or producing variables from the raw data (feature extraction).

What is data pre-processing?

It is the process of preparing the data before the work.

Feature Engineering is one of the steps in data pre-processing. Not only for machine learning but also in many data science applications, steps of data pre-processing are necessary. Hence, learning these steps will solve many problems. Furthermore, the most time-taking period in any application is data pre-processing. Even 80 % percent of the job is preparing data and 20 % is modeling.

Outliers

An outlier is a data point that differs significantly from other observations.

What do outliers cause?

In the first model, there is no outlier and the relationship between x and y is modeled by the horizontal line. If we add 3 outliers to the data set, the line demonstrating the relationship is changed. Especially, in linear models, the effect of outliers is much, which is not true in the tree models.

How can we decide that a value is an outlier?

The things that need to be considered are:

Business Information
Standard Deviation Approach: One may define the outliers as the values outside the interval between mean — 2*std and mean + 2* std ( 2 or 3 times of the standard deviation may work, depending on the case).
Z-score Approach: The relevant variable is normalized. Then the mean is changed to 0. The values outside (-2.5, 2.5) are outliers.
Box plot (Interquartile range- IQR) Method (one variable)
LOF (multi-variable)

The most commonly used one is the box plot method. First, we sort the data descending and determine the quartile elements Q1 (25%), Q2 (50%), Q3 (75%), and Q4 (100%). Then the distance between Q1 and Q3 is the interquartile range (IQR). The outliers are the data values lying outside the interval (Q1–1.5*IQR, Q3+1.5*IQR).

Indeed, to determine the outliers, the critical point is to decide what is the threshold for our work. We prefer the box plot method which is used the most in the literature. If there is no negative value in the variable, generally the lower limit is not used. We only mind the upper limit. For example, for an age variable, if there is no negativity, then we know that our lower bound will be 0. So it is enough to decide the upper bound.

Catching Outliers

We use two data sets. One of them is relatively small while the other is big.

def load():
    data=pd.read_csv("path of csv")
    return data

df = load()

Graph Technique

We draw the box plot.

sns.boxplot(x=df["Age"])
plt.show()

It seems to be that there are outliers. We want to get these values programmatically, too.

# determining quartiles
q1 = df["Age"].quantile(0.25) # Q1 
q3 = df["Age"].quantile(0.75) # Q3
iqr = q3 - q1 # interquartile range
up = q3 + 1.5 * iqr # upper limit
low = q1 - 1.5 * iqr. # lower limit

There are some outliers but is no value smaller than the lower limit.

But what if we have lots of variables? It would be great to write a function doing all the processes functionally.

def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

outlier_thresholds(df, "Age")
-----------------------------
Out[28]: (-6.6875, 64.8125)

At this point note that the quantiles may change. Sometimes we took the values between the percentages of 5 % and 95 %. It depends on the problem that we handle.

Are there any outliers?

def check_outlier(dataframe, col_name):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > up_limit) | 
    (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return Falsecheck_outlier(df, "Age")
-------------------------
Out[30]: True

In this function, we use the outlier_threshold function. Recall that the interquartile range may differ from data to data. So if we want to set them specifically, we have to add two parameters in the check_outlier function; q1 and q3.

What if there are hundreds of variables? How can we test all of them? For this, we create another function that separates all the variables into lists according to their types.

def grab_col_names(dataframe, cat_th=10, car_th=20):
    """

    It gives the names of categorical, numerical and cardinal variables.
    Note: There could be categorical variables that look like numerical variables.

    Parameters
    ------
        dataframe: dataframe
                the dataframe whose names of variables will be taken
        cat_th: int, optional
                the threshold of the  number of classes for numerical but categorical variables
        car_th: int,optional
                the threshold of the  number of classes for categorical but cardinal

    Returns
    ------
        cat_cols: list
                list of categorical variables
        num_cols: list
                list of numerical variables
        cat_but_car: list
                list of cardinal variables

    Examples
    ------
        import seaborn as sns
        df = sns.load_dataset("iris")
        print(grab_col_names(df))


    Notes
    ------
        cat_cols + num_cols + cat_but_car = total number of variables
        in num_but_cat cat_cols.

    """

    # cat_cols, cat_but_car
    cat_cols = [col for col in dataframe.columns if  
    dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if 
    dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]
    cat_but_car = [col for col in dataframe.columns if 
    dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    # num_cols
    num_cols = [col for col in dataframe.columns if 
    dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f'cat_cols: {len(cat_cols)}')
    print(f'num_cols: {len(num_cols)}')
    print(f'cat_but_car: {len(cat_but_car)}')
    print(f'num_but_cat: {len(num_but_cat)}')
    return cat_cols, num_cols, cat_but_car

Let’s apply this function to the other data set:

def load_application_train():
    data = pd.read_csv("path of csv")
    return data

df = load_application_train()
df.head()

cat_cols, num_cols, cat_but_car = grab_col_names(df)--------------------------------------------------
Observations: 307511
Variables: 122
cat_cols: 54
num_cols: 67
cat_but_car: 1
num_but_cat: 39

Now we can write loops that check all the variables.

for col in num_cols:
    print(col, check_outlier(df, col))

Grabbing Outliers

In the previous part, we questioned the types of variables and checked whether there are any missing values or not. Now we create a function that catches the indices of missing values.

def grab_outliers(dataframe, col_name, index=False):
    low, up = outlier_thresholds(dataframe, col_name)

    if dataframe[((dataframe[col_name] < low) | (dataframe[col_name]
    > up))].shape[0] > 10:
        print(dataframe[((dataframe[col_name] < low) | 
   (dataframe[col_name] > up))].head())
    else:
        print(dataframe[((dataframe[col_name] < low) | 
   (dataframe[col_name] > up))])

    if index:
        outlier_index = dataframe[((dataframe[col_name] < low) | 
    (dataframe[col_name] > up))].index
     return outlier_index

Solving Outlier Problem

Removing

One may choose to clean all missing values from the data.

def remove_outlier(dataframe, col_name):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    df_without_outliers = dataframe[~((dataframe[col_name] < 
    low_limit) | (dataframe[col_name] > up_limit))]
    return df_without_outliers

Here the problem is that even if there is only one missing value, we remove the row. So it may cause losing some part of the data, which is not preferred generally. Instead, we could re-assign outliers in the following way.

Re-assignment with thresholds

def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = 
    low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = 
    up_limit

Recap

df = load()
outlier_thresholds(df, "Age") # determining the thresholds
check_outlier(df, "Age") # checking 
grab_outliers(df, "Age", index=True) # listing variables

remove_outlier(df, "Age").shape # removing null values
replace_with_thresholds(df, "Age") # re-assigning method
check_outlier(df, "Age")

Feature Engineering & Data Pre-Processing: Outliers

Outliers

Catching Outliers

Written by Sedefozcan