Data Science Case Study: Classification in IoT

Don’t let your health and youth go to waste. Also, that’s me squatting!

Introduction

We are going to study the Daily Sports and Activities data set from the UCI Machine Learning Repository. The goal of this work is to train a classifier to predict which activities users are engaging in based on sensor data collected from devices attached to all four limbs and the torso. This will be accomplished by cleverly feature engineering the sensor data and training machine learning classifiers.

We will be referencing the work done by machine learning researchers from these two articles:

Human Activity Classification

Recognizing Daily and Sports Activities

Science is Peer Reviewed

Check out the Jupyter Notebook for this work.


Motivation

The Internet of Things ( IoT ) is a growing space in tech that seeks to attach electronic monitors on cars, home appliances and, yes, even (especially) people. IoT wearables are becoming increasing popular with users, companies, and cities. The promise of IoT is the smarter delivery of energy to the grid, smarter traffic control, real-time fitness feedback, and much more. Unsurprisingly, startups are seeking to capitalize on the promise of IoT.

The promise of a smarter city.

Fitbit has become synonymous with fitness wearables. It is popular with a diverse range of people: the marathon runner keeping track of their heart rate all the way to the casual person simply wanting to increasing the number of their daily steps.

Fitbit watch

Spire.io has the goal of using the biometric data collected from their wearable to track not just heart rate and duration of activities, but also the user’s breathing rate in order to increase mindfulness. Meditation has spread throughout western society in a big way. The physical and psychological health benefits of meditation continue to be demonstrated by neuroscience . Spire.io will surely be joined be other startups that seek to deliver technology to the growing number of users that are seeking greater preventive care of their bodies and minds.

Spire wearable

Comfy has leveraged IoT and machine learning to intelligently monitor and regulate workplace comfort. Their devices and analytics adjust the temperature of work spaces automatically and have seen to reduce employee complaints and boost productivity.

Intelligent climate control

The rapidly growing popularity of wearables and other monitors demands that data scientist be able to analyze the signal data that these devices produce. With the requisite skills, data scientist can provide actionable insight for marketing and product teams as well as build data-driven products that will increase user engagement and make all of our lives a lot easier.


About the Data

8 users all participate in the same 19 activities. Each of the 5 devices (4 limbs and 1 torso) have 9 sensors (x,y,z accelerometers, x,y,z gyroscopes, and x,y,z magnetometers). The data is collected in 5 second segments with a frequency of 25 Hz for a total of 5 minutes for each activity for each user.

The 19 activities are:

sitting (A1), 
standing (A2), 
lying on back and on right side (A3 and A4), 
ascending and descending stairs (A5 and A6), 
standing in an elevator still (A7) 
and moving around in an elevator (A8), 
walking in a parking lot (A9), 
walking on a treadmill with a speed of 4 km/h (in flat and 15 deg inclined positions) (A1 0 and A11), 
running on a treadmill with a speed of 8 km/h (A12), 
exercising on a stepper (A13), 
exercising on a cross trainer (A14), 
cycling on an exercise bike in horizontal and vertical positions (A15 and A16), 
rowing (A17), 
jumping (A18), 
and playing basketball (A19).

Data structure:

19 activities (a) (in the order given above) 
 8 users (p) 
 60 segments (s) 
 5 units on torso (T), right arm (RA), left arm (LA), right leg (RL), left leg (LL) 
 9 sensors on each unit (x,y,z accelerometers, x,y,z gyroscopes, x,y,z magnetometers)


Data Inspection | Exploratory Data Analysis

Let’s dive right in!

For simplicity, let’s load a single segment and see what the data looks like for a person walking in a parking lot.

We can see from the Left Leg and Torso Acceleration plots that the person must be walking at regular pace. This is evident by the fact that the spacing between the peaks is about constant. If someone where walking at an irregular pace (i.e. slow-fast-slow progression) then we’d expect to see a change of frequency (more on frequency later).

The acceleration of the device in all three spatial dimensions is periodic, centered around a time invariant mean.

For the curious, the vertical dimension is the X direction and the Z direction points away from the device, parallel to the ground. For more information on the orientation of the dimensions and devices, refer to Recognizing Daily and Sports Activities .

The above pair plot shows the conditional probabilities: how the X,Y,Z dimensions of the person’s acceleration correlate with each other. The diagonal plots show that the signal distributions are approximately gaussian. We can also see that the distributions are centered close to each other in the bottom triangle. The top triangle shows the conditional relationship between the dimensions as a scatter plot.

These observations are important. Since the signals are approximately normal, we can use this fact to our advantage during the feature engineering phase (more on that later).


Feature Engineering

Feature Engineering is cleaver transformations.

We are going to build on the successful research from both papers and adopt their approach to feature engineering.

We are going to append new features to each segment. The new features are the mean, variance, skewness, and the kurtosis of each row’s distribution (since the signals are normal, as we saw earlier, we can calculate their statistical moments) the first ten values of the autocorrelation 
sequence, and the maximum five peaks of the discrete Fourier transform
of a segment with the corresponding frequencies.

We’ll normalize each feature to values between [0,1], then flatten each 5 second segment into a single row with 1140 features. Such a large number of features will introduce the Curse of Dimensionality and reduce the performance of most classifiers. So we’ll reduce the dimensions by applying Principal Component Analysis (PCA).

Each flatten row will then be a single sample (row) in the resulting data matrix that the classifier will ultimately train and test on.

Steps

1. Get the 19 additional features for each of the original 45 features. 
 2. Normalize all feature between [0,1]
 3. Reduce dimensions of each segment
 4. Stack the segments to build a data set for each person

1. Extract 19 Additional features and transform formate of dataset

Let’s examine the engineered features in turn.

Mean, Variance, Skewness, and Kurtosis

We saw that the distribution of each signal are approximately Normal. This means that we can take the first four statistical moments for each 5 second segment. By including the four moments, we are helping our models better learn the characteristic of each unique activity.

Autocorrelation

In some time series tasks, such as in ARIMA , it is desirable to minimize autocorrelation so as to transform the series into a stationary state . We can see in the plot below that after two steps in the lag we hand statistically insignificant autocorrelation in the series that we saw earlier.

For our purposes, we want to extract the first 10 points from the autocorrelation for each sample and treat each of those 10 points as a new feature. Why would we want to do this? The idea is that each physical activity will have a unique sequence of autocorrelation. So we want to capture this uniqueness to help our model learn the difference between activities.

Check out the next autocorrelation plot of a different person that is jumping. We can see that this activity has no statistically significant autocorrelation (aside from the perfect autocorrelation at a lag of zero). We see that the autocorrelation sequence for jumping is different than walking.

This is the intuition and justification for create new features using the first 10 points from the autocorrelation plot. Ultimately, the validity of this, or any engineered feature, will be determined by the performance of models.

Maximum five peaks of the Discrete Fourier Transform

The Fourier Transform function maps a signal back and forth between the time and frequency space. Every signal is the linear combination of sinusoidal functions, sine and cosine.

The first equation transform a single from time space (t) to frequency space (omega). The second equations is the inverse transformation.

The equations show the continuous Transformations. In practice, coding packages like Python’s SciPy will either calculate the discrete case or perform a numerical approximation on the continuous case.

The following image shows how a signal can be decomposed into its constitute sinusoidal curves, identifying the frequency of each curve and, finally, representing the original time series as a frequency series.

Pretty cool, huh?

Below we have plots of the Torso Acceleration in the Y Dim for the Walking series of a single person. The first plot shows what the time series signal looks like and the second plot shows what the corresponding frequency signal looks like.

For our purposes, we are going to extract the 5 maximum peaks and create features for each of the those values in each of our samples. Why are we doing this? Think back to the Fourier Transform image above, the curves with the highest frequency are responsible for the macro-oscillations, while the numerous small frequency curves are responsible for the micro-oscillations. These macro-oscillations are responsible for the general shape of the curve. Each activity will have a different general shape for its signal. By capturing these influential frequencies, our machine learning models will be better able to distinguish between activities.

2. Normalize all features

All features are rescaled between the values of zero and one.

3. Reduce dimensions of each segment

The top plot shows the explained variance of all 1140 features. We can see that explained variance rapidly drops to near zero.

The bottom plot shows that after the 40th dimension the explained variance hardly changes. The goal here is to reduce the number of dimensions and include as much of the explained variance that we can — it’s a balancing act. Both research papers show that they reduced the number of dimensions to 30 and received excellent results. So we’ll follow their work and reduce our data set’s features to 30 as well.

If we where to create and follow our own heuristic for determining how many features to keep, we might choose to eliminate all but the minimum number of features that explain 90% of the variance.

We are going to take the first 30 principal component vectors.


Modeling and Predictions

Finally, on to the sexy part! We will explore 2 approaches to predicting the user’s activities.

Approach 1

We will create train and test sets that contain shuffled samples from each user. So the model will train on data from every user and predict the activities from every user in the test set.

Approach 2

We will include 7 user’s data as the training set and use the remaining
user’s data as the test set. The goal here is to predict the activities of a user that the model has *never seen before.*

In each approach we will follow the same model building framework:

  1. Split data into train and holdout sets
  2. Optimize model hyper-parameters
  3. Cross Validate model’s performance by analyze learning curves

Models

The machine leaning models used in this analysis were Logistic Regression (LR), Support Vector Machines (SVM), and Random Forest (RF). For brevity, we’ll be focusing on the LR and SVM.


Modeling Approach 1

  1. Create train and test sets that contain shuffled samples from each user.
  2. Train model on data from every user and predict the activities from every user in the test set.

Grid Search | Optimized Memory Management

A naive grid search implementation will read a copy of the dataset from disk into memory for each unique hyper-parameter combination, drastically increasing the time it takes to run a grid search.

An even more naive grid search implementation will only uses a single core to train models sequentially. Keep in mind fitting one model is a completely independent task from fitting other models. So this task is often referred to as a task that is Embarrassingly Parallel in the Data Engineering community.

The follow grid search implementation uses the ipyparallel package to create a local cluster in order to run multiple simultaneous model fits — as many as there are cores available.

This grid search implementation also takes advantage of Numpy’s memory mapping capabilities. Instead of reading a copy of the dataset from disk each time a model is fitted, we will map a read-only version of the data to memory where every single core can reference it for fitting models.

The combination of parallelization and memory mapping greatly shortens the grid search process.


Cross Validation | Learning Curves

Before we dive into what the plots are telling us about our model, let’s make sure we understand how these plots were generated.

Generating the Learning Curves

A learning curve is plotted for each of the four metrics that we’ll be using to evaluate the performance of our models: accuracy, precision, recall, and the f1 score. Each point plotted on these graphs is a metric score that was generated by the following cross validation process.

First the data is split into a train and holdout set. The train set is further split into k folds and each fold is iteratively used as either part of the training set or as the validation set in order to train the model. Once the model is trained, it is used to predict values for the training and holdout sets. The blue curves represent the prediction made on the training set and the green curves represent the predictions made on the holdout set (which we also refer to here as the test set.)

Analyzing the Learning Curves | Logistic Regression

Learning curves contain rich information about our model.

Let’s look at the accuracy learning curves. We can see that the test set score increases by about 5% when we increase the size of the training set from 1000 samples to 2000 samples. As we continue increasing the training set size, we see that the test accuracy doesn’t increase. This saturation of the test set accuracy represents the model’s Bias. The bias indicates that the model is not complex enough to learn from the data, so no matter how many training points it is trained on, it can not increase its performance. This is also known as Underfitting.

The gap between the training and test curves indicates the amount of variance in the model’s predictions. Ideally, a model will have a very small gap between these two curves indicating that the model can generalize well on unseen data. This is desirable because the alternative are larger gaps indicating that test scores that are worse than training score. This would indicate that the model is learning to only predict data that it has seen before instead of learning generalizable trends and patterns. This is known as Overfitting. We can see that Logistic Regression suffers from both Bias and Variance.

Lastly, we can see that all of the metrics for Logistic Regression never rise above 50%. If we were to randomly guess what class a sample belongs to, we’d be right about 5% of the time (since there are 19 activities). Although LR performs better than random, we want to do much better than 50% accuracy.


Analyzing the Learning Curves | SVM

The Support Vector Machine model performed substantially better than Logistic Regression. Take a look at the accuracy curve. It shows that the model was able to do a near perfect job at predicting the activity classification for the training set.

More importantly, the model is classifying activities from the test set at near 99% accuracy. The test curve shows that SVM’s performance increases as it is trained on larger datasets. The gap between the train and test curves may appear significant, but keep in mind that the difference between these two curves is about 0.01% — a very small difference. We can conclude from these learning curves that SVM suffers from very small amounts of bias and variance. This is the type of performance that we desire in models that will be pushed into production.


Precision | Recall

So far we have been focusing on the accuracy metric, but what about precision and recall?

Precision tells us about what percentage of classifications predicted to be positive are actually positive. For simplicity, let’s say we are dealing with a binary classification problem in which 100 samples are predicted to belong to the positive class. 90 out of 100 positive predictions actually belong to the positive class, in which case we label those predictions as True Positives (TP). On the other hand, 10 out of 100 positives predictions don’t actually belong to the positive class, they were negative samples incorrectly predicted to be positive, in which case we label those predictions as False Positives (FP).

Recall tells us how well the model can identify points that belong to the positive class. This may sound a lot like precision but it’s not. Recall compares TP with False Negatives (FN), where as precision compares TP with FP. The distinction here is that for every sample that is falsely predicted to belong to negative class, that is one less sample that the model can correctly identify as belonging to the positive class.

Recall is a measure of the failure in distinguishing between positive and negative classifications.

Precision is a measure of the failure to correctly predict positive classifications.

Lastly, the f1 score is a weighted average of precision and recall. The f1 score is used to get a measure of both types of failures.


Bringing it back to our case study, take a look at the precision curve for SVM. It is telling us that 99 out of 100 samples that are predicted to belong to the positive class do actually belong to the positive class. Now, because our data set has 19 classes, and not 2, the labels ‘positive’ and ‘negative’ class lose meaning. When more than 2 classifications are present, we can reinterpret the test set precision learning curve to mean 99 out of 100 classifications that are predicted to belong a specific class do actually belong to that class.


Modeling Approach 2

  1. Create a training set comprised of 7 randomly chosen users and a test set comprised of the remaining user.
  2. Train model to predict which activities a previously unseen user is engaged in, not just for users that it has seen before.

Analyzing the Learning Curves | SVM

Whoa! What happened?

Remember that the training set contains 7 users and the test set contains the 8th user. The learning curves show a tremendous amount of overfitting. The training curves in blue represent the 7 users in the training set. The model can predict activities from users that it has seen already. However the green curves tell us that the model is unable to generalize to new users.

These results are likely attributed to the feature engineering approach that we took. People are unique in how they walk, jump, walk up and down stairs, and so on. It is reasonable to conclude that we have succeeded in capturing the characteristic body movements from specific individuals but have fallen short of capturing a generalizable understanding of how these activities are performed in groups of people.

Depending on our purpose, we can arrive at the conclusion that we have succeeded or fallen short of our goals. If our goal is to build and dedicate a model for each individual, then we can conclude that this work is a smashing success!

On the other hand, if our goal is to build a model that learns what the walk signal or the jump signal looks like from any user, then we would have to admit that we have fallen short.


Conclusion

We have seen how an understanding of time series data and signal processing can lead to engineering features and building machine learning models that predict which activity users are engaged in with 99% accuracy.

Our approach proved successful in building a model that can predict activities from users that appear in both the training and test set. The model was able to learn which signals correspond to activities like walking or jumping for specific users. However, when users are limited to appearing in either the training or test set, we saw that the model is unable to acquire a generalized understanding of which signals correspond to specific activities, independent of the user.

This work can be directly applied to IoT startups like Fitbit and Spire. Both companies are collecting signal data from wearables. Classifying what type of activities their users are engaged in is valuable information that can be used to build data-products and drive marketing efforts.


About the Author

Alexander Barriga has a M.S. in Data Science from GalvanizeU (University of New Haven) and a B.A. in Physics from UC Berkeley. He currently works as a Data Science instructor at General Assembly in San Francisco.