Predicting Human Behaviour Activity using Deep Learning (LSTM)

10 min readMay 26, 2018

This is my very first article, so please if I made any error or mistake, please let me know and any help will be highly appreciated.

1. Introduction

Anticipating future actions is a key component of intelligence, specifically when it applies to realtime systems, such as robots or autonomous cars. Predicting the behavior of human participants in strategic settings is an important problem in many domains. In this work, presenting an alternative, a deep learning approach that automatically performs cognitive modeling without relying on such expert knowledge.

Game theory provides a powerful framework for the design and analysis of multi-agent systems that involve strategic interactions. The behavioral game theory literature has developed a wide range of models for predicting human behavior in strategic settings by incorporating cognitive biases and limitations derived from observations of play and insights from cognitive psychology.

Deep learning has demonstrated much recent success in solving supervised learning problems in vision, speech and natural language processing. By contrast, there have been relatively few applications of deep learning to multiagent settings. A natural starting point in applying deep networks to a new domain is testing the performance of a regular feed-forward neural network. To apply such a model to a normal form game, we need to flatten the utility values into a single vector of length mn + nm and learn a function that maps to the m-simplex output via multiple hidden layers. Feed-forward networks can’t handle size-invariant inputs, but we can temporarily set that problem aside by restricting ourselves to games with a fixed input size. I experimented with that approach and found that feed-forward networks often generalized poorly as the network overfitted the training data. One way of combating overfitting is to encourage invariance through data augmentation.

2. Let’s get into the problem

Predicting human action has a variety of applications from human-robot collaboration and autonomous robot navigation to exploring abnormal situations in surveillance videos and activity-aware service algorithms for personal or health care purposes. As an example, in autonomous healthcare services, consider an agent- monitoring a patient’s activities, trying to predict if the patient is losing her/his balance. If the agent is capable of predicting the next action, it could determine whether s/he might fall and take an action to attempt to prevent it. DataSet for this problem can be downloaded from here.

The dataset is the result of monitoring a 26-year-old man in a three-room apartment where 14 binary sensors were installed. These sensors were installed in locations such as doors, cupboards, refrigerators, freezers or toilets. Sensor data for 28 days was collected for a total of 2120 sensor events and 245 activity instances. The annotated activities were the following: “LeaveHouse”, “UseToilet”, “TakeShower”, “GoToBed”, “Prepare Breakfast”, “Prepare Dinner” and “Get Drink”. In this specific case, the sensors were mapped one to one to actions, resulting in the following set of actions: “UseDishwasher”, “OpenPansCupboard”, “ToiletFlush”, “UseHallBedroomDoor”, “OpenPlatesCupboard”, “OpenCupsCupboard”, “OpenFridge”, “UseMicrowave”, “UseHallBathroomDoor”, “UseWashingmachine”, “UseHallToiletDoor”, “OpenFreezer”, “OpenGroceriesCupboard” and “UseFrontdoor”.

For the training process, the dataset was split into a training set (80% of the dataset) and a validation set (20% of the dataset) of continuous days. In order to make the training process more streamlined, I apply the sensor to action mappings offline. This allows us to train the deep neural model faster while still having the raw sensor data as the input. To do the training, we use n actions as the input to predict the next action. That is, the training examples are the sequences of actions, and the label is the next action that follows that sequence, being a supervised learning problem.

3. Shifting to technical gear

3.1 Introduction to basic terms and Action Embeddings for Action Representation

There are two main monitoring approaches for automatic human behaviour and activity evaluation, namely, vision- and sensor-based monitoring. Sensor-based behaviour and activity evaluation are the most widely used solutions , as vision-based approaches tend to generate privacy concerns among the users. Sensor-based approaches are based on the use of emerging sensor network technologies for behaviour and activity monitoring. The generated sensor data from sensor-based monitoring are mainly time series of state changes and/or various parameter values that are usually processed through data fusion, probabilistic or statistical analysis methods and formal knowledge technologies for activity recognition.

In order to properly describe it, I have defined a series of concepts on the basis of those proposed in actions, activities and behaviours. Actions describe the simplest conscious movements, while behaviours describe the most complex conduct. We have extended the model proposed in dividing the behaviours into two different types, intra-activity behaviours and inter-activity behaviours. This allows us to better model different aspects of the user’s behaviour.

The algorithm presented in this article models the inter-activity behaviour, using actions to describe it. One of the characteristics of our algorithm is that it works on the action-space instead of the sensor-space. The advantage of working on the action-space is that different sensor types may detect the same action type, simplifying and reducing the hypothesis space. This is even more important when using semantic embeddings to represent these actions in the model, as the reduced number of actions produces more significant embedding representations.

Given a sequence of actions S(act)=[a1,a2,…,a(la)], where la is the sequence length and ai ∈ R^da indicates the action vector of the ith action in the sequence, we let Context(ai)=[ai−n,⋯,ai−1,a1+1,⋯,ai+n] be the context of ai, where 2n is the length of the context window. We let p(ai|Context(ai)) be the probability of the ith action in the sequence for action ai. The target of the model used to create the embeddings is to optimize the log maximum likelihood estimation (logMLE):

La(MLE)=∑ ai∈S (logp(ai|Context(ai)))

In the model, I use the Word2Vec implementation in Gensim to calculate the embedding values for each action in the dataset. We represent each action with a vector of 50 float values, because of the small number of action instances compared with the number of words that are usually used in NLP tasks. Instead of providing the values directly to our model, I have included an embedding layer as the input to the model. In this layer, I store the procedural information on how to transform an action ID to its embedding. Adding this layer allows us to train it with the rest of the model and, in this way, fine-tune the embedding values to the current task, improving the general accuracy of the model.

3.2 LSTM Network for Behaviour Modelling

In order to create a probabilistic model for behaviour prediction, I have used a deep neural network architecture based on recurrent neural networks, specifically on LSTM’s. In inter-activity behaviour modelling, the prediction of the activity label for an action depends on the actions registered before. The recurrent memory management of LSTM’s allows us to model the problem considering those sequential dependencies. LSTMs are the central element of the proposed architecture, they can be divided into three different parts: the input module, the sequence modelling module and the predictive module.

The input module receives raw sensor data and maps it to actions using previously defined equivalences. These actions are then fed to the embedding layer.
The embedding layer receives the action IDs and transforms them into embeddings with semantic meaning. This layer is configured as trainable, that is, able to learn during the training process. The layer weights are initialized using the values obtained by using the Word2Vec algorithm. The action embeddings obtained in the input module are then processed by the sequence modelling module.

Finally, after the LSTM layer, I have the predictive module, which uses the sequence models created by the LSTMs to predict the next action. This module is composed by densely connected layers with different types of activations.
This LSTM network layer has size of 512 network units. First we use two blocks of densely connected layers has each size of 1024 network units with rectified linear unit (ReLU) activations. After the ReLU activation, we use dropout regularization with a value of 0.8. Dropout regularization prevents the complex co-adaptations of the fully connected layers by ignoring randomly selected neurons during the training process. This prevents overfitting during the training process. Finally, we use a third fully connected layer with a softmax activation function to obtain the next action predictions. As we want to select the most probable actions for a given sequence.

3.3 Evaluation of Model

This project have been implemented using Keras and were executed using TensorFlow as the back-end. Each of the experiments was trained for 1000 epochs, with a batch size of 128, using categorical cross-entropy as the loss function and Adam as the optimizer. After the 1000 epochs, I selected the best model using the validation accuracy as the fitness metric. The action embeddings were calculated using the full training set extracted from the Kasteren dataset and using the Word2Vec algorithm, and the embedding layer was configured as trainable. I evaluated the effects of taking into account the timestamps of the input actions. To validate the results of the architecture, we performed three types of experiments:

Architecture experiments: I evaluated different architectures, varying the number of LSTMs and fully connected dense layers. For the architecture experiments, I have tried different dropout values (with a dropout regularization after each fully connected layer with a ReLU activation), different numbers of LSTM layers, different types of LSTM layers (normal and bidirectional), different numbers of fully connected layers and different sizes of fully connected layers. We also compared using embeddings for the representation of the actions versus the more traditional approach of using one-hot vectors, in order to ascertain the improvements that the embeddings provide.
Sequence length experiments: I evaluated the effects of altering the input action sequence length. For the sequence-length experiments, I varied the length of the input action sequence in a network, but maintained the rest of the values to a dropout regularization of 0.8, one LSTM layer with a size of 512, two fully connected layers with ReLU activation with a size of 1024 and one final fully connected layer with softmax activation.

Time experiments: I evaluated the effects of taking into account the timestamps of the input actions. I tried different ways of taking into account the timestamps of the actions in the input sequence and analysed three different options. In the first configuration (T1), I used two parallel LSTM layers, one for the action embeddings and the other for the timestamps, concatenating the results of both layers before the fully connected layers. In the second configuration (T2), we concatenated the action embeddings and the timestamps before a single LSTM layer. In the third configuration (T3), the embeddings were connected to an LSTM layer, whose output was concatenated with the timestamps and sent to another LSTM layer. All the configurations used a dropout regularization of 0.8, a LSTM layer size of 512, two fully connected layers with ReLU activation with a size of 1024 and one final fully connected layer with softmax activation.

Metrics: To properly validate the predicting capabilities of the proposed architectures, I evaluated how they perform using the top-k accuracy. The top-k accuracy is a standard metric in different prediction and modelling tasks. The top-k accuracy (acc_at_k) is defined as

acc_at_k=1/N ∑ i=1 to N (1|ai ∈ Cki|)

where ai is the expected action and Cki is the set of the top k predicted actions; 1[.]→{0,1} represents the scoring function; when the condition in the first part is true, the function value is 1; otherwise, the value is 0. In our case, if the ground-truth action is in the set of k predicted actions, the function value is 1. To evaluate this model, I provide the accuracy for k values of 1, 2, 3, 4 and 5.

Summary of best results can be seen below:

4. Conclusion

In this project, I have proposed a multilevel conceptual model that describes the user behaviour using actions, activities, intra-activity behaviour and inter-activity behaviour. Using this conceptual model, I have presented a deep learning architecture based on LSTMs that models inter-activity behaviour. This architecture offers a probabilistic model that allows us to predict the user’s next actions and to identify anomalous user behaviours. I have evaluated several architectures, analysing how each one of them behaves for a different number of action predictions.

To make a human action prediction system useful for real- time applications such as robotics, we need to reduce the algorithm latency as much as possible and predict longer in the future. The complete project will be updated soon here.

5. References

https://papers.nips.cc/paper/6509-deep-learning-for-predicting-human-strategic-behavior.pdf
http://www.mdpi.com/2076-3417/8/2/305/htm
https://arxiv.org/ftp/arxiv/papers/1709/1709.07894.pdf
https://github.com/aitoralmeida/c4a_behavior_recognition
https://sites.google.com/site/tim0306/datasets