Feature Engineering

Srikannan Balakrishnan
Analytics Vidhya
Published in
9 min readJul 20, 2020

According to a survey in Forbes, data scientists spend 80% of their time on data preparation. This shows the importance of feature engineering in data science. Here are some valuable quotes about Feature Engineering and its importance:

Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering — Prof. Andrew Ng.

The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering — Luca Massaron

What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in an improved model accuracy on unseen data.

Basically, all machine learning algorithms use some input data to create outputs. This input data comprises features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly

Having and engineering good features will allow us to most accurately represent the underlying structure of the data and therefore create the best model. Features can be engineered by decomposing or splitting features, from external data sources, or aggregating or combining features to create new features.

Goal of Feature Engineering

Feature engineering can be classified into two use cases

To get the best prediction accuracy: Modelling for prediction accuracy is the default, if the goal is to have a productive system

To decode and explain inherent properties: When the model should be easy to interpret, and one can acquire better knowledge of the problem

Feature Engineering is an Art

The data is a variable and is different every time. How can we decompose or aggregate raw data to better describe the underlying problem?

Tabular data is described in terms of observations or instances (rows) that are made up of variables or attributes (columns). An attribute could be a feature.

The idea of a feature, separate from an attribute, makes more sense in the context of a problem. A feature is an attribute that is useful or meaningful to your problem. It is an important part of an observation for learning about the structure of the problem that is being modelled.

In computer vision, an image is an observation, but a feature could be a line in the image. In natural language processing, a document or a tweet could be an observation, and a phrase or word count could be a feature. In speech recognition, an utterance could be an observation, but a feature might be a single word.

Let’s see some examples of Feature Engineering on Numerical, Categorical & Text columns.

Decompose a Date-Time

A date-time contains a lot of information that can be difficult for a model to take advantage of in its native form (i.e. 2014–09–20 20:45:40).

For example, we may suspect that there is a relationship between the time of day and other attributes.

We could create a new numerical feature called Hour of Day for the hour that might help a regression model.

We could create a new ordinal feature called Part Of Day with 4 values Morning, Midday, Afternoon, Night with whatever hour boundaries you think are relevant. This might be useful for a decision tree.

We can use similar approaches to pick out time of week relationships, time of month relationships and various structures of seasonality across a year.

Percentages

Often when dealing with continuous numeric attributes like proportions or percentages, we may not need the raw values having a high amount of precision. Hence it often makes sense to round off these high precision percentages into numeric integers. These integers can then be directly used as raw values or even as categorical (discrete class based) features. Let’s try applying this concept in a dummy dataset depicting store items and their popularity percentages

Items & Popularity Percentage
After Transformation of Percentage Scales

Based on the above outputs, we can guess that we tried two forms of rounding. The features depict the item popularises now both on a scale of 1–10 and on a scale of 1–100. You can use these values both as numerical or categorical features based on the scenario and problem

Decompose Categorical Attributes

Imagine we have a categorical attribute, like “Item Color” that can be Red, Blue or Unknown.

Unknown may be special, but to a model, it looks like just another colour choice. It might be beneficial to better expose this information.

You could create a new binary feature called “Has Color” and assign it a value of “1” when an item has a color and “0” when the color is unknown.

Going a step further, you could create a binary feature for each value that Item Color has. This would be three binary attributes: Is Red, Is Blue and Is Unknown.

These additional features could be used instead of the Item Color feature (if you wanted to try a simpler linear model) or in addition to it (if you wanted to get more out of something like a decision tree).

Here we can see how the City field is converted into features using one-hot encoding method

Binning

The problem of working with raw, continuous numeric features is that often the distribution of values in these features will be skewed. This signifies that some values will occur quite frequently while some will be quite rare. Besides this, there is also another problem of the varying range of values in any of these features. For instance, view counts of specific music videos could be abnormally large, and some could be small

Binning, also known as quantization, is used for transforming continuous numeric features into discrete ones (categories). These discrete values or numbers can be thought of as categories or bins into which the raw, continuous numeric values are binned or grouped into. Each bin represents a specific degree of intensity and hence a specific range of continuous numeric values fall into it

Example of Age Binning
Histogram after Binning

Grouping

In most datasets, every instance is represented by a row in the dataset, where every column shows a different feature of the instance. This kind of data called “Tidy”. Datasets such as transactions rarely fit the definition of tidy data above, because of the multiple rows of an instance. In such a case, we group the data by the instances and then every instance is represented by only one row. The key point of group by operations is to decide the aggregation functions of the features.

► For numerical features, average and sum functions are usually convenient options, whereas for categorical features it more complicated.

► Numerical columns are grouped using sum and mean functions in most of the cases. Both can be preferable according to the meaning of the feature. For example, if you want to obtain ratio columns, you can use the average of binary columns. In the same example, sum function can be used to obtain the total count either.

► For non-numerical features, the first option is to select the label with the highest frequency.

Reframe Numerical Quantities

Our data is very likely to contain quantities, which can be reframed to better expose relevant structures. This may be a transform into a new unit or the decomposition of a rate into time and amount components

For example, we may have Item Weight in grams, with a value like 6289. We could create a new feature with this quantity in kilograms as 6.289 or rounded kilograms like 6. If the domain is shipping data, perhaps kilograms is sufficient or more useful (less noisy) a precision for Item Weight.

The Item Weight could be split into two features: Item Weight Kilograms and Item Weight Remainder Grams, with example values of 6 and 289 respectively.

There may be domain knowledge that items with a weight above 4 incur a higher taxation rate. That magic domain number could be used to create a new binary feature Item_Above_4kg with a value of “1” for our example of 6289 grams.

We may also have a quantity stored as a rate or an aggregate quantity for an interval. For example, Number of Customer Purchases aggregated over a year.

In this case We may want to go back to the data collection step and create new features in addition to this aggregate and try to expose more temporal structure in the purchases, like perhaps seasonality. For example, the following new binary features could be created: Purchases_Summer, Purchases_Fall, Purchases_Winter and Purchases_Spring.

Sample Sales & Sale Dates
After re-framing quantities

Feature Selection

Curse of Dimensionality: The problem of having too many features. More features make the model more expressive but not all the features are relevant. The higher the dimensionality, the higher the chances of spurious features

Feature Selection is one of the methods to solve the high dimensionality. It is a key concept using which we can select the features which will have a maximum impact on our Target variable. For e.g., If our business problem is to predict orders, then we should select only the relevant variables/fields which is having good variance to predict the same instead of using all the variables/fields.

Notable approaches for Feature Selection:

► Supervised approaches

► Unsupervised approaches

► Regularisation

Features are allocated scores and can then be ranked by their scores. Those features with the highest scores can be selected.

Feature importance scores can also provide you with information that you can use to extract or construct new features, similar but different to those that have been estimated to be useful.

A feature may be important if it is highly correlated with the dependent variable (the thing being predicted).

Automated Feature Selection

Supervised approaches:

Filter approach : Compute some measure for estimating the ability to discriminate between classes Typically measure feature weight and select the best n features → supervised ranked feature selection

Features with high scores

Wrapper approach: Search through the space of all possible feature subsets. Each search subset is tried out with a learning algorithm

Wrapper approach General algorithm:

  1. Initial subset selection
  2. Try a subset with a learner
  3. Modify the feature subset
  4. Rerun the learner
  5. Measure the difference

GOTO 2

Advantages: combination of features, ignore redundant/irrelevant features

Disadvantages: computationally intensive 2 basic ways for i) initial subset selection, ii) modification of subset: forward selection and backward elimination

Unsupervised approach: Unsupervised ranked feature selection Scoring function to rank the feature according to their importance … then just use the top 5% (10%, 25%, …) e.g. for textual data use the frequency of words within a reference corpus

Unsupervised approach

Feature Engineering for Text Mining

Bigram Features When working with single words as features, often the sequence information is lost but, this could potentially a source of information → introduce new feature as a combination of two adjacent word

N-grams : Bigrams can be extended for more than two words → n-grams Can be extended to allow gap in between words (skip n-grams)

Instead of processing the whole texts, we can split it into single words and try to find the ones with most occurrences.

For example, we may have access to a database of some Human Resource department. One of the fields in there may be the academic title. We can find there many things like Bachelor of Engineering, Master of Science, and Doctor of Philosophy. But there will be many more. What we can extract from this are words like bachelor, master, and doctor without the specific field. These spans let’s say a four-level (together with no title) categorical feature of the education level.

A similar example is a full name with a title. In such a field, we can find phrases like Mr. XXXX, Mrs. XXXX, and Miss YYYY. We can extract the titles Mr., Mrs. or Miss that indicate the gender and the marital status. As discussed, there are plenty of ways to use the textual data without using the full power of NLP which can be very computationally expensive.

References:

Here are some generally relevant and interesting slides:

· https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

· Feature Engineering and Selection (PDF), CS 294: Practical Machine Learning, Berkeley

· Feature Engineering (PDF), Knowledge Discover and Data Mining 1, by Roman Kern, Knowledge Technologies Institute

· Feature Engineering Studio, Course Lecture Slides and Materials, Columbia

· Feature Engineering (PDF), Leon Bottou, Princeton

--

--