Week 4: feature engineering

A dataset to focus on data cleaning and feature prep!

Published in

Joguei os Dados

9 min readMay 22, 2020

Fourth week of Pyrentena! I came across with this great video of a lesson on feature prep and data cleaning. Short, straight forward and yet very well explained! Available on YouTube. Inspired by the new things I learned on this video lesson, I decided this week’s dataset would be focused on feature engineering. It was nice to take a look at how all this data preparation could really impact the results on my model.

“Census Income” is the dataset of the day and has features such as gender, race, age, native country, etc. It’s available at UCI Machine Learning, who already proposed a question for everyone who decided to work with it: Predict whether income exceeds $50K/yr based on census data.

After importing pandas and numpy to read the .csv file, it looked like this:

Lots of columns, needed them to be all numeric, so my first step was to convert my target variable INCOME to a 0–1 binomial and split into 2 dataframes.

# Transforming the column into 0 or 1 
# Assign outcome as 0 if income <=50K and as 1 if income >50Kdf['income'] = [0 if x == '<=50K' else 1 for x in df['income']]# Spliting into 2 dataframes: one with all features (X) one with target variable (y)X = df.drop('income', 1)
y = df.income

Basic Data Cleaning

Time to transform all categorical features into numerical (so my model can learn from them). To accomplish that, I used the method get_dummies, from pandas, to transform this categories into a binomial 0s and 1s. I applied to the ‘education’ feature to test it out:

# Use get_dummies in pandas to transform this categories into 0 & 1

print(pd.get_dummies(X['education']).head())
# outcome10th  11th  12th  1st-4th  5th-6th  7th-8th  9th  ?  Assoc-acdm  Assoc-voc  \
0     0     0     0        0        0        0    0  0           0          0   
1     0     0     0        0        0        0    0  0           0          0   
2     0     0     0        0        0        0    0  0           0          0   
3     0     1     0        0        0        0    0  0           0          0   
4     0     0     0        0        0        0    0  0           0          0   

   Bachelors  Doctorate  HS-grad  Masters  Preschool  Prof-school  \
0          1          0        0        0          0            0   
1          1          0        0        0          0            0   
2          0          0        1        0          0            0   
3          0          0        0        0          0            0   
4          1          0        0        0          0            0   

   Some-college  
0             0  
1             0  
2             0  
3             0  
4             0

It worked! Still had lots of categorical features, lets take a look on how many:

# use for loop to go through categories and print the ones with type == 'object'for col_name in X.columns:
    if X[col_name].dtypes == 'object':
        unique_cat = len(X[col_name].unique())
        print(f'Feature {col_name} has {unique_cat} unique categories')# outputFeature 'workclass' has 8 unique categories
Feature 'education' has 17 unique categories
Feature 'marital_status' has 7 unique categories
Feature 'occupation' has 15 unique categories
Feature 'relationship' has 6 unique categories
Feature 'race' has 6 unique categories
Feature 'sex' has 3 unique categories
Feature 'native_country' has 40 unique categories

Yes, that was a lot. It didn’t seem smart to do the ‘get_dummies’ for all the categories I had, so I learned in the YouTube video how to build a function that would do a for loop and transform the values properly and much faster.

# Create a list of features to 'dummy' so you dont have to do the same thing for all the variables todummy_list = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']# Function to dummy all the categorical variables used for modeling using a loop fordef dummy_df(df, todummy_list):
    for x in todummy_list:
        dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
        df = df.drop(x, 1)
        df = pd.concat([df, dummies], axis=1)
    return df

Cool! With all of our features as numeric type, It was time to handle the missing values.

# How much of data is missing?X.isnull().sum().sort_values(ascending=False).head()# ouputfnlwgt                 107
education_num           57
age                     48
education_Doctorate      0
education_7th-8th        0
dtype: int64

Again, quite a few missing data. I decided not to drop those since they could have some impact on the model. My next estrategy was to replace with either mean or median. In order to do that, I imported Imputer from Scikit-Learn:

from sklearn.impute import SimpleImputer# imputing with medianimp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(X)
X = pd.DataFrame(data=imp.transform(X) , columns=X.columns)

I don’t have a recipe here for when to go either way, that has been my biggest struggle so far: now that I know all of this techinics, when to use what?! I try and research and also experiment to see what works best, so I learn as I go.

Data Exploration

Now that there aren’t any missing values and everything is numeric, is time to explore our data to see if we have outliers that could possibly temper our prediction model. I learned in the YouTube class to use Tukey IQR to detect those, It’s a function that basically finds out what values are lower or greater than the 1st and 3rd quartile (If we applyied a describe() method, would be the values lower than 25% and greater than 75%). Let’s see how it works:

# function to find outliers using tukey def find_outliers_tukey(x):
     q1 = np.percentile(x, 25)
     q3 = np.percentile(x, 75)     
     iqr = q3-q1      
     floor = q1 - 1.5*iqr     
     ceiling = q3 + 1.5*iqr     
     outlier_indices = list(x.index[(x < floor)|(x > ceiling)])  
     outlier_values = list(x[outlier_indices])           return outlier_indices, outlier_values

I am in love with this function, so glad I learned that. That are other ways to detect outliers such as Kernel’s density, but this is the one I understood better, so I’m going with that. Now lets call it and see what it tells us:

tukey_indices, tukey_values = find_outliers_tukey(X['age'])
print(np.sort(tukey_values))# output[76. 76. 76. 76. 76. 76. 76. 76. 76. 77. 77. 77. 77. 77. 78. 78. 79. 79.  79. 80. 80. 80. 81. 81. 81. 81. 82. 88. 90. 90. 90. 90. 90. 90. 90.]

According to Tukey, our outliers regarding ‘age’ are between 76 and 90 years old. We can apply this for other columns and decide if we will drop those values, or some part of it, or if we are keeping it. In my case, I decided to keep those.

Feature Engineering

This for me is the most abstract part, still have to study to really understand how it works. We basically have two options when it comes to feature engineering: increase dimmensionallity (add features to hopefully make the model more precise) or decrease dimmensionallity (remove features so the model is more accurate). If you are familiar with the dataset, increase new features isn’t such a random thing — and sometimes, you can even do it by hand. For instance, you have the total amount of something, and you want the average amount of this something to be a column in your data, so your model can use avg as a feature.

In my case, I had no idea what features I could create, I only knew I had to increase the dimmensionallity since I only had 15. Luckily, there are functions to help us do that! Polynomial features was the techinic used in the YouTube class and It’s a really cool method: it creates two-way interactions for all features. Let’s see how to do that:

# Use PolynomialFeatures in sklearn.preprocessing to create two-way interactions for all features from itertools import combinations 
from sklearn.preprocessing import PolynomialFeatures# function to add interactionsdef add_interactions(df):
    # Get feature names
    combos = list(combinations(list(df.columns), 2))
    colnames = list(df.columns) + ['_'.join(x) for x in combos]
    
    # Find interactions
    poly = PolynomialFeatures(interaction_only=True, include_bias=False)
    df = poly.fit_transform(df)
    df = pd.DataFrame(df)
    df.columns = colnames
    
    # Remove interaction terms with all 0 values            
    noint_indicies = [i for i, x in enumerate(list((df == 0).all())) if x]
    df = df.drop(df.columns[noint_indicies], axis=1)
    
    return df

It seems a little abstract now, but once you call the function and print out the results, is easier to make sense of the relationships:

X = add_interactions(X)
print(X.head())# outputrace_White_native_country_United-States   sex_Female_native_country_Other  \
0                                       1.0                              0.0   
1                                       1.0                              0.0   
2                                       1.0                              0.0   
3                                       0.0                              0.0   
4                                       0.0                              1.0   

   sex_Female_native_country_United-States   sex_Male_native_country_Other  \
0                                       0.0                            0.0   
1                                       0.0                            0.0   
2                                       0.0                            0.0   
3                                       0.0                            0.0   
4                                       0.0                            0.0

I printed out only a small part of the output because it generates quite a few relationships, but is enough for us to see: race-white-native_country, sex-female-native_country (and loads of others). By identifying that, the function establish some relationships that the model will take under consideration when predicting values.

Feature Selection and Model Building

Finally! It took us a while to get here, but I learned so much on how to clean and prepare my data. Besides, you are gonna see the positive effect on our model!

# Use train_test_split in sklearn.cross_validation to split data into train and test setsfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_state=1)

With the large set of features we generated with polynomial features, now is time to do some feature selection so we don’t cause overffitng — and also slow down my computer. To do that, Sklearn has a feature selection we can call to help out:

import sklearn.feature_selection
from sklearn.feature_selection import SelectKBestselect = SelectKBest(k=15) 
selected_features = select.fit(X_train, y_train)
indices_selected = selected_features.get_support(indices=True)
colnames_selected = [X.columns[i] for i in indices_selected]X_train_selected = X_train[colnames_selected]
X_test_selected = X_test[colnames_selected]

The ‘k’ paramater defines the number of features we want to select in SelectKBest(k=15). There is no rule on how exaclty this number should be, so I tryied with different ‘k’ and evaluated the results. With k as 20, the accuracy was around 0.71 — not bad. But with k as 10, the accuracy jumped to 0.85! It became clear that the best numerical range for k to be was in between 10 and 20, so a tryied out with 15. Let’s see how it worked and take a look at the selected features!

# list of selected featuresprint(colnames_selected)# output['marital_status_Married-civ-spouse', 'relationship_Husband', 'age_education_num', 'age_marital_status_Married-civ-spouse', 'age_relationship_Husband', 'education_num_marital_status_Married-civ-spouse', 'education_num_relationship_Husband', 'hours_per_week_marital_status_Married-civ-spouse', 'hours_per_week_relationship_Husband', 'marital_status_Married-civ-spouse_relationship_Husband', 'marital_status_Married-civ-spouse_race_White', 'marital_status_Married-civ-spouse_sex_Male', 'marital_status_Married-civ-spouse_native_country_United-States ', 'relationship_Husband_sex_Male', 'relationship_Husband_native_country_United-States ']

Nicely done. Time to fit our model with a Logistic Regression algorithm and measure its performance.

Measuring Performance

To measure the performance, my YouTube class choose Area Under the Receiver Operating Characteristics (AUC ROC) to evaluate its accuracy. I had nerver heard of it yet, so I did some research and found out when is useful:

When to use AUC:

when it comes to a classification problem
When we need to check or visualize the performance of the multi — class classification problem

Seems like exaclty what I need: classify between >50k or <50k and It was a multi class problem. Since I’m into best practices of software engineering lately, I built a function that I could use again at other projects:

# Function to build model and find model performance
# using Logistic Regressionfrom sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def find_model_perf(X_train, y_train, X_test, y_test):
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_hat = [x[1] for x in model.predict_proba(X_test)]
    auc = roc_auc_score(y_test, y_hat)
    
    return auc

Calling te function to see the result:

# Find performance of model using preprocessed dataauc_processed = find_model_perf(X_train_selected, y_train, X_test_selected, y_test)
print(auc_processed)# output0.8821631097523177

0.88! I was so happy with this result. Since the class was focused on feature engineering, we learned how to compare our result (with processed and cleaned data) with the results when using unprocessed data. Let’s compare:

Build model using unprocessed data for comparison

We started dropping missing values and removing non-numeric columns so it does not result an error. By doing it, we had the following dataset:

# Take a look again at what the unprocessed feature set looks likeprint(X_unprocessed.head())# outputage    fnlwgt  education_num  capital_gain  capital_loss  hours_per_week 0  39.0   77516.0           13.0          2174             0              40 1  50.0   83311.0           13.0             0             0              13 2  38.0  215646.0            9.0             0             0              40 4  28.0  338409.0           13.0             0             0              40 5  37.0  284582.0           14.0             0             0              40

Age, education, capital gain, capital loss, hour per week: those were our numeric features that the model was gonna use.

# Split unprocessed data into train and test set
# Build model and assess performanceX_train_unprocessed, X_test_unprocessed, y_train, y_test = train_test_split(
    X_unprocessed, y_unprocessed, train_size=0.70, random_state=1)

auc_unprocessed = find_model_perf(X_train_unprocessed, y_train, X_test_unprocessed, y_test)print(auc_unprocessed)
0.6119711042311662

0.61, dropped quite a bit. Let’s organize this result in a nice print so we can have it on record:

# Compare model performanceprint(f'AUC of model with data preprocessing: {auc_processed}.')
print(f'AUC of model with data without preprocessing: {auc_unprocessed}.')
per_improve = ((auc_processed-auc_unprocessed)/auc_unprocessed)*100
print(f'Model improvement of preprocessing: {per_improve}.')

The ouput:

AUC of model with data preprocessing: 0.8821631097523177
AUC of model with data without preprocessing: 0.6119711042311662
Model improvement of preprocessing: 44.15110511804314%

By doing data cleaning and feature prep, feature engineering and a bit hiperparameter tunning, we improved our model by greater than 44%!. More work, better results! This sets the difference between winners on Kaggle competitions. And is also a great practice!

You can check out April Chen’s full class on feature engineering in this YouTube video. Notebook with the complete code is available on my Github.