Feature Engineering Based on Sensory Data

9 min readAug 22, 2019

When you collect sensory data, is possible that you face very different kind of data. For example, you could have gyroscope data but also some Twitter posts. Is not easy to combine this information, but it is possible to extract some useful features from the dataset in order to maximize the predictive performance of the final model. Different options are available for this, and we will consider features that use the notion of time, both in the time domain and the frequency domain, and features for unstructured data.

Time Domain

Imagine that we have a training set in the form of a time series. Suppose the target is to predict if the person that generates the data in the next table is tired or not. You will clearly see that it will be quite difficult to make this prediction only based on the data at that specific time point. For example, if you check the Heart Rate when it is at 120, the label is both tired and not. The same holds for the activity level High and being tired. But if we consider a history of some time points, some more informative clues can be derived that help us to predict the target: two consecutive heart rates of 80 or above could be a good predictor for our target.

Numerical Data

Let us consider the Heart Rate again. We can define a window size which expresses a number of discrete time points that we will consider in order to extract more useful information. The window size expresses the number of prior instances or time points that are considered. Of course, it depends on the data and type of measurement and should be set based on domain knowledge or rigorous experimentation[2]. In order to summarise the data, there are many possibilities: one could consider the mean, median, minimum, maximum, standard deviation, slope, or any other measures deemed appropriate [3,4]

Example of the table after the aggregation function [1]

Categorical Data

Looking at the previous example, the activity type might also be a very good predictor when considering previous values. For example, running two times in a row in the last three time points results in being tired. However, this relationship is impossible to capture with a numerical value as explained before. As the first thing, we need to identify what combinations are useful.

Following what proposed by [5], we focus on finding temporal patterns in the values of categorical attributes that occur sufficiently frequent. Temporal relationships inspired by Allen [6 ] are taken into account and we check and focus on values that occur in succession (one before the other, b ) or occur at the same time point, i.e. co-occur c. An example temporal pattern from the previous dataset is for example “Activity level = high (c) Activity type = running” or “Activity type = inactive (b) Activity type = running”. The co-occurrence relationship is most valuable if we combine it with the before relationship since learning algorithms will be able to identify the predictive power of the two attributes together, except if we consider co-occurrence relationships before the current time point. We do not select all the patterns present in the data, we select only those that have enough support (how often the pattern occurs in the data compared to the number of time points in our dataset). A certain minimal support threshold θ will be the basis to generate patterns. A nice property comes with the notion of support that helps us to reduce the search space. It only needs to consider k -patterns that extend the (k − 1)-patterns, and the only way to extend them is by using the 1-patterns that fulfil our minimum support threshold. This is because the support of a new pattern can never be greater than the support value of the least supported subpattern it includes.

Mixed Data

It is possible to derive categorical values from the numerical data that can, in turn, be used by applying the above algorithm for categorical data. Two cases are possible: certain ranges are known that can be used to identify categorical values (like low, high, and normal blood pressure), or there is only numerical data without an interpretation of what the values mean in the specific context (for example weight, it is difficult to say whether 90 kg is a healthy weight if you do not know how tall a person is).

Frequency Domain

What if we want to recognise if the user is running? While we could use a historic window of the accelerometer measurements and, for instance, the variance we observe there, it might be more natural to look at the periodicity in the measurements of the accelerometer. How can we understand this periodicity? Using a Fourier Transformations.

Fourier Transformations

The idea of a Fourier transformation [7] is that any sequence of measurements can be represented by a combination of sinusoid functions with different frequencies. A typical example includes sound waves, where we have waves at different frequencies that combined make up the sound we hear. Consider f0 as the base frequency:

λ + 1 is the number of data points in our window. Multiples of this base frequency will be used, i.e. k*f0, where k is a natural number. The higher the value of k the higher the frequency of the signal. To get from k to a frequency in Hertz we need to know how many datapoints represent a second (called Nsec):

k is the number of periods of the sinusoid over our λ + 1 samples while λ+1 Nsec is the number of seconds. For each frequency, we need to specify an amplitude, denoted as a( k). We need frequencies {0* f0, . . . , λ*f0} , i.e. λ + 1 frequencies, starting at 0. The value of a sinusoid function at frequency k at a time point t is represented as:

Is important to notice that ei*k*f0*n represents a sinusoid function with the specified frequency k *f0. We can now compute the value of our measurement at time point t in our window:

Now we have fixed the frequencies of the sinusoid functions, but we do not yet have the amplitudes (a( 0), . . . , a(λ) ) that match the data in our window. Using the Fast Fourier Transform this is possible in an efficient way.

Of course, this is not the only feature we can derive in the frequency domain. The amplitude is associated with each of the relevant frequencies that are part of the time window selected, it will have a unique value for each time point we consider.

Depending on the setting for λ, this might result in a lot of additional features that are very specific. But luckily, systems to aggregate them have been developed.

The frequency with the highest amplitude gives an indication of the most important frequency in the windows under consideration

A second option is to compute the frequency weighted signal average. This metric provides information on the average frequency observed in the window (given the amplitudes) and might shed a bit more light on the entire spectrum of frequencies.

The power spectral entropy can also be computed. The resulting value represents how much information is contained within the signal. It determines, whether there are one or a few discrete frequencies standing out of all others.

Features for Unstructured Data

A lot of unstructured data is collected that can be used in machine learning approaches. Just think of all the texts people send each other or all the Facebook posts we are generating. We will focus on some simple approaches for natural language processing (NLP) without looking at the semantics of the text.

Pre-processing Text Data

In order to directly create attributes from words or apply some other approaches to extract attributes a number of basic steps are needed:

Tokenization: identify sentences and words within sentences.
Lower case: change the uppercase letters to lowercase.
Stemming: identify the stem of each word to reduce words to their stem and map all different variations of, for example, verbs to a single term.
Stop word removal: remove known stop words as they are not likely to be predictive.

Bag of Words

Now we are ready to define attributes for the most simple case, the so-called n-grams of words. n represents the number of words we consider as a single unit or attribute. A unigram considers single words, a bigram pairs of words, a trigram a combination of three words, and so on. We look at these combinations in each of our sentences. The approach is called bag of words because we just count the number of occurrences of words irrespective of their order of occurrence.

The value for an attribute is the number of occurrences of the n-gram in the text associated with the instance. Then we simply replace the count values by binary values to indicate the presence of the n-gram.

TF-IDF

An alternative approach is to use the TF-IDF (for Term Frequency Inverse Document Frequency, see [8]) score as a value of an instance i for the n-grams we have identified. This takes into account how unique the n-gram is over the different pieces of text we see in all instances.

Topic Modeling

The raw usage of n-grams results in a fine-granular and large set of attributes. An alternative is to use an algorithm that extracts more high-level topics from the set of texts we have available in our dataset. Topics are specified by a set of words and associated weights wji. To find the topics, a way to do is using the Latent Dirichlet Allocation (LDA), cf. [9]. I am not going to explain the method in details, details can be found in the paper cited.

Conclusion

Now, thanks to this chapter, you should know how to engineer new features from sensory data. If you want to read more, I definitely suggest to you [1], a very interesting book well written. The idea of this post comes from one of the chapters of the book. If you want to practice, the book comes with exercise and python code. Enjoy! More posts about the book are coming.

References

[0] Pic from https://www.udemy.com/feature-engineering-for-machine-learning/

[1] Table from Hoogendoorn, M., & Funk, B. (2017). Machine Learning for the Quantified Self: On the Art of Learning from Sensory Data. Springer.

[2] Gu, F., Kealy, A., Khoshelham, K., Shang, J.: User-independent motion state recognition using smartphone sensors. Sensors 15(12), 30636–30652 (2015)

[3] Lara, O.D.: Labrador, M.A.: A survey on human activity recognition using wearable sensors. IEEECommun. Surv.Tutor. 15(3), 1192–1209 (2013). doi:10.1109/SURV.2012.110112. 00192, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6365160

[4] Transition-Aware Human Activity Recognition using smartphones: Reyes-Ortiz, J.L., Oneto, L., Sama, A., Parra, X., https://orcid.org/0000-0002-4943-3021 Anguita, D.A.I.O. Neurocomput. Int. J. 171, 754–767 (2016). doi:10.1016/j.neucom.2015.07.085, http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=psyc11&NEWS=N&AN=2015–39180–001

[5] Batal, I., Valizadegan, H., Cooper, G.F., Hauskrecht, M.: A temporal pattern mining approach for classifying electronic health record data. ACM Trans. Intell. Syst. Technol. (TIST) 4(4), 63 (2013)

[6] Allen, J.F.: Maintaining knowledge about temporal intervals. Commun. ACM 26(11), 832–843 (1983)

[7] Bracewell, R.: The fourier transform and its applications (1965)

[8] Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

[9] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993– 1022 (2003)

[10] https://www.slideshare.net/Cochrane.Collaboration/bcw-cochranetech2013