What is Feature Engineering?
Feature engineering is a process in which we create new features from the existing features in our data set. The new features are often more relevant to the prediction task than the original set of features, and thus can help the machine learning model achieve better results.
Sometimes the new features are created by applying simple arithmetic operations, such as calculating ratios or sums from the original features. In other cases, more specific domain-knowledge on the data set is required in order to come up with good indicative features.
Feature Engineering Example
To demonstrate feature engineering, we will use the California housing dataset available at Scikit-Learn. The objective in this data set is to predict the median house value of a given district in California, given different features of that district, such as the median income or the average number of rooms per household.
First, we fetch the data set:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names
In order to explore the data set, let’s merge the features and the labels into one DataFrame:
mat = np.column_stack((X, y))
df = pd.DataFrame(mat, columns=np.append(feature_names, 'MedianValue'))
df.head()