How Deep Learning May Transform Location-Aware Computing
Location-awareness sits at the core of location-based services (LBS). However, accurately estimating locations of an object is sometimes not that straightforward. Global positioning system (GPS), as the greatest location-aware computing enabler outdoors, directly outputs geospatial coordinates, but its error can go way beyond tolerance for some applications. In GPS-denied areas, location can be inferred indirectly from raw data provided by sensors like inertial measurement units (IMUs) and cameras. Conventionally, data, either directly measured geospatial coordinates or inferred location, has to go through a rather laborious man-crafted data processing pipeline before it can be consumed by higher level LBS. This article reviews two recent attempts on introducing deep learning models into location-aware computing, effectively reducing expert involvement.
Object Tracking via Partially-observable Stochastic Processes
This AAAI16 paper presents an end-to-end object tracking approach, where one end is the raw data collected from a simulated 2D laser scanner, and the other end is the entire environment state including even occluded objects, as illustrated below:
The key issue worth noting behind this tracking problem, is that raw data could capture only part of the environment due to occlusion. Similar problems of partially-observable stochastic processes are traditionally solved by Bayesian filtering (like Kalman filter), which in turn involves a lot of hand-designed state representation and implies certain assumptions or sampling on model distributions. As claimed by the authors, this paper presents the first end-to-end trainable solution so that robot agent can learn belief state representation, as well as the corresponding predict and update operations in an unsupervised manner, making it more effective and labour-saving compared to traditional approaches.
The tracking problem is framed as a generative model with one hidden Markov process h detailing environment dynamics. Meanwhile, its appearance layer y captures the location of individual objects and can be partially observed by a third layer, the layer of the sensor measurements x, see the figure below:
And the target becomes estimating the conditional distribution of y at moment t given history input sequences of x. Notice that y is actually not a Markov process, thus methods such as Hidden Markov Model could not be applied here. Alternatively, this can be handled by recursive Bayesian estimation via recursively computing the conditional distribution of h at moment t given history input sequences of x (the belief). Instead of directly estimating the target, the target can be estimated as the conditional distribution of y given the belief. The paper expresses the final target by two neural networks with weights of W_F and W_P, while the first network specifies the model from history input sequences to the belief, and the second network specify the model from the belief to the location (y at t). Two networks are chained together, which effectively makes them a Feed-Forward Recurrent Neural Network as a whole. The hidden state representation for the belief is learned from raw data, and acts as the network’s memory passed from a time-step to the next. The filtering process is illustrated below:
To be more specific, the authors use a four-layer feed-forward recurrent network with convolutional operations, followed by sigmoid activation as each layer. And the architecture of the network can be seen below:
The above-mentioned model can be trained in the usual way of minimising the negative log-likelihood of the target distribution. However, the ground-truth data for y may not be accessible due to occlusion. The authors propose to train the network by not only predicting the next time step, but also few more steps (n steps for example) into the future by dropping out all observations between the current time step and n steps after the current one (setting them to 0). The observation dropout has to be carried out spatially and temporally across all dataset to avoid overfitting. This allows the network to be trained without ground-truth data, a relatively unsupervised way.
The training set has a total number of 10,000 sequences of length 2000 time steps. With 50,000 iterations of stochastic gradient descent training. There are two significant findings. Firstly, the unsupervised training showed almost identical results as the supervised learning, suggesting the observation drop-out is effective. Secondly, the activation for the belief layer shows an adapted representation of different object moving patterns. The training progress can be seen below or via the attached video to the original paper.
However, probably due to being first of its type, the authors do not provide any quantitive metric for performance or comparison to existing works. In the end, according to the authors, they are extending this piece of work on more realistic data and more challenging robotics tasks.
Location Prediction based on Geospatial Trajectories
A very recent blog post by Launchpad.AI introduces the long short term memory (LSTM) into transportation operations. For many industrial and outdoor applications, GPS and radio-frequency identification (RFID) tracking technology are prevalent nowadays since they can capture real-time position information of up to meter-level accuracy. How to translate geospatial data to improve operational processes is, however, a less understood topic. The author proposed an automated geospatial anomaly detection system by evaluating if the object being tracked deviates from an expected trajectory. The LSTM network is harnessed to learn from history data to predict a look ahead position. The proposed system is applied to a real dataset consisting of one-month trajectories of 28,000 taxis in Beijing.
In additional to timestamp and taxi’s corresponding position (latitude and longitude), speed, orientation, and occupancy status are also regularised and factored into the sequential data. Sequential data will be then processed by LSTM network. The author also takes identity information of the object into account. Identity information (unique driver ID in this case) is firstly preprocessed by embedding, and then merged with LSTM outputs. The semantic implication of driver IDs, to some degree, captures if two taxis share similar moving patterns. After a dense fully connected layer, the entire network outputs a one-minute lookahead predictions in the form of latitude and longitude. The network is implemented using Keras, and the architecture of the network is illustrated below:
The fully trained model is then evaluated on a held-out test dataset. According to the author, taxis on average travel 391 meters away from where they started in one minute, meaning that if the system uses the current position as the expected location then it will cause an error of 391 meters. The final result s of LSTM+embedding with 5 taxis as training set achieved an error of 2076 meters. However, as the number of taxis increases to 8,000, the error is significantly reduced to 152 meters, below the average traveling distance in one minute.
Both cases demonstrate well how location-aware computing can benefit from deep learning: one infers position from raw sensor data, and the other detects operational anomaly directly using position data. Evidently, deep learning on sequential data is already very well established, however, its application on location-aware computing has not been that popular yet, perhaps due to the following reasons:
- Difficulty in the evaluation. As a matter of fact, the above-mentioned cases are more like proof-of-concept work since they failed to evaluate their work quantitatively by proposing comparable and fair evaluation metric.
- Lack of reliably annotated dataset. The ground-truth of location-aware computing (usually positions) is in general not accessible. For instance, in the case of the first paper, recording the ground-truth locations of all objects in realistic environment may not be practical, rendering large scale learning less feasible.
- Temporal complexity. Very similar to voice recognition, the minimum length of a temporal data sequence that makes geospatial sense is arbitrary and highly contextual-dependent. In voice recognition and optical character recognition (OCR), this problem is currently handled by the connectionist temporal classifier (CTC), whether this can be extended to location-aware computing still remains unexplored.
All in all, location-aware computing at present has a very limited level of machine learning, while requiring an excessive amount of expert knowledge when it comes to data mining and interpretation. These two cases show promising directions that automatically learning from data can effectively improve current location-aware computing.
Author: Raymond Kwan | Editor: Has Wang | Localized by Synced Global Team: Xiang Chen