Feature Engineering for Kaggle Competition

Published in

Analytics Vidhya

8 min readJan 1, 2020

0. Introduction

According to Wikipedia:

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive.

Feature engineering is an important step for most of Kaggle competitions. This article will discuss feature commonly used feature engineering techniques for four different feature types: numerical, categorical, temporal and spatial data. These techniques have help me achieve Kaggle Competition Expert title.

1. Feature types

Before we start feature engineering, we should understand different feature types first. In common Kaggle competition or data science projects, we would probably deal with the following types of features:

- Numeric (Interval or Ratio)

- Categorical (Ordinal or Nominal)

- Temporal (Date and Time)

- Spatial (Coordinates and location )

Numerical features measure the magnitude of a quantity, such as bank balance, temperature, age, height, weight, populations …

There are two different numeric features: Interval and Ratio. If a numeric value has a meaningful (non-arbitrary) zero point, then it is Ratio value, otherwise it is interval. For example, the zero point for Celsius Temperatures is arbitrary and not meaningful, thus it is a interval value. We can not say 20°C is twice of 10 °C as they are not ratio. However, bank balance with a meaningful zero point, is a ratio value. This means we can say $200 is the double of $100.

Categorical values are also called discrete variables. They describe the groups or categories of a subject, such as the sex, product type, name of a city. All these are also called nominal variables. In contrast, another kind of categorical variable is called ordinal variable. Ordinal variables are ranked categorical variable, such as preference score, customer rating.

Different types of variables require different feature engineering techniques, which will be introduced in the following.

More readings: https://en.m.wikipedia.org/wiki/Statistical_data_type

2. Numeric Features

Feature engineering approaches to used depends on the machine learning models we are using. And for most of the Tree-based models(such as regression tree, random forest, xgboost, lightGBM), we do not need to do any feature engineering. Because tree-based models make predictions based on the comparison of different values, rather then the magnitude of numeric values.

Non-tree based models, such as linear regression, Support vector machine, and neural network models, make predictions based on the magnitude of numerical values. That means these models are sensitive to the scale of input values.

To solve this issue, scalers can be used to scale numeric variables into a standardised range, such as MinMaxScaler / StandardScaler in Python scikit-learn package.

Besides, outliers in numerical can cause problems to non-tree based models. A straightforward solution to outliers is to simply remove them from the dataset. Other solutions are Winsorization and Ranking. Winsorization clip the numerical values using a lower and upper bound. Ranking methods replace numeric values with their relative ranking in the dataset.

Moreover, Numeric features with skewed distribution could also bring problem to non-tree based models. For examples, some numerical variables are log-normally distributed, such as monetary values. In these cases, we can apply log transformation to make a normally distributed variable. Depending on the variable distribution, other transformation techniques such as raising to the power of 0.5 can be used.

Besides, many feature generation methods can be used with numeric variables, mostly based on prior knowledge / data analysis. For example, with a dataset describes prices of goods in a supermarket, we can extract fractional part from price values (such as .99 from 4.99), which probably affect people’s perception of prices.

More reading: https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b

3. Categorical Features

Categorical features are one of the most important features in data science projects. Because categorical variables are not numerical and cannot be processed by most of machine learning models, we have to encode them and represent them in a numerical way.

There are many encoding methods for categorical features. The most well known one is one-hot encoding. It create N new columns for N categories of a particular column. And the one or zero values in these newly created columns represents whether the category exist or not in that row. However, in practice, one-hot encoding does not work very well because it can cause collinearity problem. Moreover, when cardinality of a categorical column is very high, one hot encoding will generate a very sparse dataset, which is not good for machine learning models.

collinearity problem: https://www.algosome.com/articles/dummy-variable-trap-regression.html

One alternative approach to categorical encoding is numerical encoding, which simply numbers each of the categories in a column with positive integers. For example, we can encoding three categories A/B/C into 1/2/3. It does not increase the data size and works very well with tree-based models. And it is the recommended encoding method for lightGBM model.

https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html

However, numeric encoding does not work well with non-tree based models. One of the solution is the binary encoding. Binary encoding converts the positive integers from numeric encoding into binary form, which is a sequence of zeros and ones. And then it creates several new columns to store those binary sequence. It looks like a compressed version of one-hot encoding, which create a dense dataset and can avoid collinearity problem.

Several feature generation methods are also applicable to categorical features. One is Frequency encoding, it calculates the frequency of each labels in a categorical variable and create a new column which uses the corresponding calculated frequency. For example, if in a categorical variable contains 5 label A, 3 label B, 2 label C, then A/B/C can be encoded by 0.5/0.3/0.2, respectively. The logic behind this is that for some cases, what matters is the popularity of a label, not the label itself.

Another similar encoding method is mean encoding. It calculate the mean value of prediction target for each label in a categorical variable. Then it can be used as a new feature. Besides mean value, other aggregation statistics such as median or mode can also be used for target encoding. One important thing to be noted for target encoding is that the mean value should be calculated from training data only after training / validation split. Otherwise the model can foresee the validation data and cause data leakage.

The last thing for categorical variable is feature crossing. For many non-tree based models, the interaction among input variables are not taken into consideration, which means we need to do our own feature engineering and create features that can represent those interaction. Usually we can just simply concatenate two or more categorical variables into a single one. However, the choice of features to cross basically depends on your understanding of your data set and problem.

More reading: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

4. Temporal feature

Generally, temporal feature is referred to as any feature that changes with time, such as stock price and temperature. And moving average is one of the basic and popular methods to generate temporal features. This research area is called time series analysis, which is a big topic and will not be covered in this short article.

In a narrow sense, temporal feature is referred to as the time stamp in the dataset, such as date-time , date, or time. A common way to process these time stamps is to extract temporal attributes from timestamp, such as year, month, day, hour, minute, seconds. Moreover, attributes such as day of week, day of year, season, can also help in some cases. If we know the country of region where the dataset comes from, we can flag each day as holiday or not, which is especially helpful for some commercial case studies.

Besides temporal attributes, differences between dates can also be important features to be generated. For example, the days from current date to next holiday can be an important feature for customer behaviour prediction. And the duration since last important events can be related to its effect on our study subject.

5. Spatial Data

Spatial data refers to any data that contains spatial information such as a map, a GPS location or trajectory, or even a satellite image. Again, this is another big topic called spatial information science that cannot be covered by this short article. However, there are still some important points to know when handling spatial data.

One of the important attributes of spatial data is that it describes locations on a sphere. We should not use any simple equations from plane geometry. For example, when calculating distance between to locations described by longitude and latitude, sqrt((lat1-lat2)²+(lon1-lon2)²) is totally incorrect. The best formula to use is great circle formula.

https://en.m.wikipedia.org/wiki/Great-circle_distance

Another attribute to take into consideration is spatial correlation, which is also called Tobler’s first law of geography: “everything is related to everything else, but near things are more related than distant things.” When we are predicting a subject with spatial information, we can usually use its nearby information as references or model training samples. For example, the price of a house can be close related to the price of nearby houses. Also the distance between the house and its nearest train station can also be an important factor.

https://en.m.wikipedia.org/wiki/Tobler%27s_first_law_of_geography

6. Missing values

The last thing to pay attention to is missing values. Usually tree based models can take care of missing values very well as long as we put ‘nan’ or ‘null’ value into the dataset. However, this does not work for non-tree based models. One solution is to create another columns in the dataset to indicate whether the data in a particular column is missing or not, which may increase the data size and bring other problem.

Another approach is to fill null values based on the average of median value of corresponding columns. Moreover, we can build another machine learning model to predict those missing values. However, these reconstructed values are probably inaccurate and may decrease model performance. When data records with missing values are not many, dropping those noisy records can be a better solution.

7. Wrap up

To summarise, this article has discussed feature engineering for four different feature types: numeric, categorical, temporal and spatial.

Ideally numeric variables should be scaled into standard normal distribution for non-tree based models, and do not need any processing for tree based models.

Categorical variables are better to be numeric encoded values for tree-based models and binary encoded for non-tree based models.

Temporal and spatial variables are very special data types and require particular domain knowledge for feature processing.

As we have discussed, many of the feature engineering methods depend on the model you are using. And tree-based models usually require less feature engineering, which is why tree-based models ( such as XGBoost and lightGBM) are so popular in Kaggle competitions.

Last but not least, above mentioned methods are just commonly used feature engineering approaches. In a real machine learning project, feature engineering largely depends on the researchers’ knowledge of the problem itself.