Integrate the Fitbit stream with Machine Learning to predict user activity
What’s Streamr Marketplace?
Streamr Marketplace is a global data exchange platform offering easy integration APIs. In the Marketplace, any application can easily setup a stream to exchange/monetize data such as weather condition, traffic status and personal fitness for example.
Their explainer video is here:
How to use the data on Streamr Marketplace
Our previous blog post introduced how to publish personal fitness data (including daily steps, heart rate, consumed calories and etc) on the Streamr Marketplace. While building an app to aggregate data was useful, the real question is what the data can be used for. This blog showcases how one could leverage Machine Learning models (ML) to predict what kind of activities data producers (users) were doing at various time of the day. Specifically, we first subscribed to the published fitness data from Marketplace. Then we used our trained ML model to predict the activity users were taking part of. Using the model we can predict in real-time whether the a user was walking, working or playing a sport (in this case we trained the model to detect basketball patterns).
How to preprocess the recorded data?
Preprocessing data is a key step in ML modeling. Correct preprocessing can speed up the convergence of machine learning models and improve the final model accuracy. The preprocessing of data includes a lot of steps, including data filtering, data outlier processing and data normalization. The data measured by the device mainly includes: time, number of steps and heart rate. Because the time format of the test is: hour + minute + second, such the data is not conducive to modeling operations, so the time is planned uniformly.
If the number of minutes is greater than 30, add one to the number of hours; if the number of minutes is less than 30, the number of minutes is ignored, and the number of hours is not added.
The above operation is similar to the rounding operation of minutes. Because there are no outliers or missing values when testing the data. If an abnormal value occurs, the removal operation is generally performed. If a missing value occurs, it is usually done with 0 or inserting the mean.
Normalizing the data is the second step. In the objective function of the machine learning algorithm (such as the RBF kernel of the SVM, support-vector machine, or the regularization of the l1 and l2 of the linear model), the basis of the objective function in many learning algorithms is to assume that all features are zero mean and have variances on the same order. If the variance of a feature is orders of magnitude larger than the variance of other features, it will dominate the learning algorithm, increasing the system weight of certain features, and skew the final results.
After the data is normalized, the optimization process of the optimal solution will become smooth, and it is easier to correctly converge to the optimal solution. Generally speaking, StandardScaler and MinMaxScaler are the start-of-the-art methods for normalizing the data. The StandardScaler first passes the data by subtracting the mean, then divides the result by the variance (or standard deviation). This data normalization method is processed to conform to the standard normal distribution, that is, the mean is 0 and the standard deviation is 1. MinMaxScaler scales features to specific intervals, scales features to a given minimum and maximum value, or converts the maximum absolute value of each feature to a unit size. Such a method is a linear transformation of the original data, normalizing the data to [0,1]. Since the data used in this article is positive and does not follow a normal distribution, the method of normalization of MinMaxScaler data is finally adopted.
Among many machine learning algorithms, SVM algorithm has obvious superior performance, and a solid theoretical foundation. It has a good application in the fields of handwritten digit recognition and face recognition. The biggest difference between SVM and other learning methods is that the SVM algorithm is based on the minimization of structural risk, rather than minimizing the empirical risk in traditional algorithms. Besides, the SVM can make a compromise between model complexity and system generalization ability. Moreover, in the SVM-based training and prediction process of classification model, the classification effect only depends on a few support vectors, so it has certain processing power for the classification of some high-dimensional samples.
Therefore, this experiment used the SVM algorithm which is good at solving small sample classification problems. The essence of the SVM algorithm is to find the appropriate classification hyperplane in the solution space, and obtain the largest sample classification interval through the best classification hyperplane. The following figure below is the schematic diagram of the SVM algorithm.
The SVM algorithm used in this blog is called by calling the scikit-learn algorithm package. In Python, scikit-learn is a widely used library for implementing machine learning algorithms. Here is the partial code:
clf = OneVsRestClassifier(SVC(C=15,gamma=0.28,probability=True))
In the modeling process, two-tenths of the data are randomly selected to form a test set, which is used to test the generalization ability of the model. The remaining nine tenth is the training set. The model is completely separated from the test set, making the results of the verification system more objective, accurate and persuasive. The modeling process steps for the classification system are as follows:
- Data partitioning: the data set is randomly divided into two parts, 80% of the data set is used as a training dataset, and 20% of the data set is used as a test dataset.
- Data processing: standardize the data to minimize the difference in eigenvalues between individuals.
- Model establishment: the machine learning algorithm is used to systematically model the training dataset, and then the test dataset data separated above is used to verify the accuracy of the classification model.
- Model analysis: adjust the key parameters in the classification algorithm and repeat step 3 until the system reaches a high precision and stable state.
- Results: repeat the above operation, select different test set sample modeling each time, obtain the accuracy of each test result, and finally calculate the average accuracy of the classification system.
Final Results
Walking activity is represented by 0, sleeping is represented by 1, working is represented by 2, and playing basketball is represented by 3. The following is the final accuracy of all the data modeling obtained:
92.65% 93.55% 89.43% 92.11% 91.04%
91.40% 91.04% 92.11% 90.14% 89.07%
91.04% 91.76% 91.22% 93.37% 89.94%
91.94% 93.73% 92.29% 93.19% 90.68%
By calculating the average accuracy of these 20 replicate experiments, the final classification accuracy is: 91.59%. In summary, the classification accuracy obtained in this experiment was high, indicating that these features are highly correlated with human activities and deserve further analysis. However, due to the limited amount of dataset and fitness data sellers, future operators should increase the type of sampling population, increase the total amount of data, and find other information besides heart rate (such as PPG signals).
Source codes
All the source codes can be found via the Github.