A brief introduction to Feature Extraction

One useful tool you should try when extracting features from time series

Jonte Dancker
6 min readNov 8, 2022
Photo by Jake Hills on Unsplash

You are using your raw time series data for your supervised or unsupervised learning models and the data set is too large to be efficiently managed? Or the speed of your learning models is rather slow?

Then you might want to have a closer look into feature extraction. In this article I will give you a brief introduction on the aim, benefits, and challenges of feature extraction. Lastly, I will show you how you can easily derive features from time series data using the tsfresh package in Python.

The aim of feature extraction is to derive features from the original data set that are informative, relevant, and not redundant. With this, feature extraction is related to dimensionality reduction. For example, we can extract the trend, seasonality or kurtosis of a time series or transform raw data into numerical values.

Compared to using the original data set directly, feature extraction has the following benefits. Using features instead of the original data improves the data handling and improves the learning speed of our machine learning methods as we have a smaller data set. Moreover, we often get better results as we reduce the chance of over-fitting and a more efficient model construction.

However, selecting and combining features is time consuming and tedious. We need to identify possible features, extract the features, and then only use the needed features. The last step is very important as every feature we use should help to improve our model.

To do so we need a deep understanding of the data set. Hence, feature extraction is a repetitive task, which is a combination of exploratory data analysis and feature engineering. Also, it can be helpful to have domain expertise as this let’s us identify relevant features more easily. Lastly, we need a validation strategy to check the effectiveness of each feature in improving our model. Such a strategy should be more sophisticated and robust than just plotting and comparing the results. For example, we could build a baseline model and then build a model in which we add one feature after the other and compare if the new model improves against our baseline model. If so, we keep the feature and otherwise we remove the feature.

Extracting features of time series

Before diving into the feature extraction of time series I briefly want to give you some time series features you should have in mind and which are often a good starting point in detecting patterns besides the often used features, such as trend, seasonality, kurtosis, and skewness.

Date time features: These can be derived by the timestamp of each observation in the time series. For example, we can extract the day, hour, weekday, holiday, etc. These features can be easily extracted using Python’s pandas datetime functionality.

Lag and window features: Sometimes values are influenced by values at a prior time steps. Hence, detecting such patterns we can improve our models. Although it can be difficult to determine such features we can use autocorrelation to find the right lag between values that influence each other.

We can extract the features using functions from pandas or numpy. Or, we can make our life a bit easier and automate the feature extraction by using tsfresh.

Automating feature extraction

tsfresh is a python package that automatically calculates several hundred time series characteristics/features. These include simple features such as minimum or median but also more complex ones, such as correlation. The package also provides methods that evaluates the relevance/importance of each feature. We can install tsfresh by running pip install tsfresh.

I will show you how to use tsfresh based on an electricity load profile time series. My DataFrame contains the active power value P for each day at a specific time. As you can see the electricity load profile has a 15 minute resolution and covers the days between late May 2020 and late June 2021.

DataFrame of electricity load profiles in 15 min resolution.

To extract features of our time series we need to run the extract_features() function and pass the time series data as a DataFrame. Moreover, we need to specify the Data Format of our data.

As we can have multiple time series in the DataFrame we need to indicate which entity each time series belongs to by passing the column name to the column_id parameter. For example, in my case the entities are the days as I want to extract the features of the daily electricity load profile. If I would want to extract the features that describe the load at each time step the entities would be the time step.

As we want to extract the features of a time series it is important that the values are sorted based on the time. To ensure that the time series are sorted we pass the column indicating the time steps to the column_sort parameter.

The function returns a DataFrame that contains all extracted features for each entity.

from tsfresh import extract_featuresextracted_features = extract_features(timeseries, column_id="date", column_sort="time", impute_function=impute)

As often not all features can be calculated there will be features with NaN values. We can remove them by passing the parameter impute_function or running:

from tsfresh.utilities.dataframe_functions import impute

impute(extracted_features)

If you do not want to extract all possible features as you already identified which parameters are relevant or you want to fine tune your feature extraction, you can pass a dictionary with the relevant features to default_fc_parameters. For example, for my electricity load profile time series I wanted to only extract features including the median, mean, minimum, maximum, kurtosis, etc.

fc_parameters = {'sum_values': None, 
'median': None,
'mean': None,
'standard_deviation': None,
'variation_coefficient': None,
'variance': None,
'skewness': None,
'kurtosis': None,
'last_location_of_maximum': None,
'first_location_of_maximum': None,
'last_location_of_minimum': None,
'first_location_of_minimum': None,
'maximum': None,
'minimum': None,
'number_peaks': [{'n': 1}, {'n': 3}, {'n': 5}, {'n': 10}]}
extract_features = extract_features(timeseries, column_id="date", column_sort="time", impute_function=impute, default_fc_parameters=fc_parameters)

To have a look which features you can choose from you can have a look at tsfresh‘s parameters by running:

from tsfresh.feature_extraction import ComprehensiveFCParametersfc_parameters = ComprehensiveFCParameters()

Automating feature selection

After we have automatically extracted features we need to decide which ones are relevant. Here, tsfresh can also help us if we have a target vector. For this, tsfresh identifies features based on the significance of the feature to predict the target. For example, in my case a target vector could be the cluster labels after clustering the daily electricity load profiles.

We can filter the features by using tsfresh‘s select_features() function and passing the target vector as a pandas Series or numpy array:

from tsfresh import select_featuresfeatures_filtered = select_features(extracted_features, target)

Instead of extracting and filtering the features in two steps, tsfresh also allows us to do the feature extraction and selection in a single step. For this, we just use the extract_relevant_features function:

from tsfresh import extract_relevant_featuresextract_relevant_features(timeseries, y, column_id='date', column_sort='time')

Conclusion

Feature Extraction can help you to improve your machine learning models as we reduce the size of the data set using informative and non-redudant features. However, feature extraction and feature selection in particular can be very tedious and time consuming as we need to identify possible features and decide whether they are relevant or not. Thus, in this article I have shown you a Python package that can help to you to find relevant features of your time series.

If you liked the article and/or have any comment I am happy to hear it.

--

--

Jonte Dancker

Expert in time series forecasting and analysis | Writing about my data science side projects and sharing my learnings