In order to aid tree health and status monitoring, we use Graph Neural Networks to geo-localize them in urban settings in an automated and efficient way.
Why monitor street trees?
Cities worldwide have initiated efforts to combat the unprecedented rising temperatures and heatwaves caused by “Urban heat islands”, which is a result of covering natural land with impervious surfaces (concrete, pavements, buildings, etc.). It turns out the solution is simple, plant more trees. Trees provide shade that covers the impervious surfaces, dispersing radiation, meanwhile releasing small amounts of vapour from their leaves creating a cooling effect. Another reason to monitor tree is to enable long-term tree health studies, have a look at some of our related works here and here.
Towards Large-Scale Tree Mortality Studies in Cities with Deep Learning & Street View Images
Most of the existing old databases have one crucial deficiency: the tree locations are recorded as street addresses…
Automating Tree Health Monitoring from Images with Machine Learning
Trees seem to be the superheroes of the climate crisis. Forests provide a vast array of ecosystem services, store large…
Geo-localizing trees in urban settings manually by in-situ surveys of field crews or volunteers is a laborious task, especially in large cities, this becomes unfeasible. Fortunately using deep learning, we can “crawl” through a city’s available imagery, and perform this task at a large scale with few resources and labour.
Our task is to Detect, Reidentify and Localize static objects (precisely trees) in urban settings using multiple sources, and multiple views. Most methods require sequences of images with depth, include cameras intrinsic and extrinsic values, or perform the task on multiple stages. We utilise Graph Neural Networks (GNNs) to achieve this in a flexible, and efficient manner.
Keep it simple, no extra sensors needed…
We rely only on geotagged images that are already available for general purposes, like Street View and social-media geotagged imagery. We believe that it is unnecessary to use a special rig with sensors or depth cameras and that relying on the images’ metadata and geometry to accomplish this task is sufficient.
What do you mean by geometry?
Alongside the visual features, the images come with useful metadata. This metadata usually includes the camera’s heading and geo-coordinates. Using geometry, we can assign geo-coordinates to pixels inside the image, and vice versa: finding pixels corresponding to geo-coordinates. (explained in detail in our paper)
We created a web tool to demonstrate these functions in action. First, click anywhere you’d like on the street to grab the closest 4 panoramas to that spot. Then, if you proceed to move your mouse around it will grab the geo-coordinate and project in pixels inside the 4 views.
These projection functions give us a rough estimate of a location, which helps us to predict which trees are corresponding to each other in multiple views, and not to count an instance of a tree more than once.
Representing the scene as a graph.
In contrast to our previous works (Wegner et al., 2016, Nassar et al., 2019) we represent our data as graphs, with the nodes representing the trees, and the edges between them depicting the correspondence between the different instances of the objects. The nodes carry the CNN features of the object, meanwhile, the edges carry the ground-truth value ([0,1]) if its a match or not. This setup gives us the flexibility to have a variable number of images and targets as input.
This sets out our problem as a “link prediction” task, which can be solved with Graph Neural Networks (GNNs). The literature on GNN is extensive, check out here and here to understand further what GNNs are and the difference between them and CNNs.
Tutorial on Graph Neural Networks for Computer Vision and Beyond (Part 1)
I’m answering questions that AI/ML/CV people not familiar with graphs or graph neural networks typically ask. I provide…
An end-to-end method.
We strive with this work to come up with an end-to-end method that isn't composed of multiple stages that have to be trained separately, and tweaked parameter wise. So we created a method that could match the correspondence of objects in the scene to avoid double counting and provide the geolocation.
Our method works by following these steps:
- A batch of images from multiple views and the corresponding camera metadata are passed through the backbone network (EfficientNet) and the multi-scale feature aggregator (BiFPN) of the object detector that provides different levels of features.
- Anchors are then generated across the feature layers and passed through two sub-networks to provide classification and bounding box predictions. Based on the IoU of the ground truth with the anchors, we select the positive and negative anchors.
- The features of these anchors are used to generate a dense fully connected graph.
- The graph is then fed to a GNN to predict if the nodes should be matched by classifying the edge connecting them. In parallel, the regressed bounding boxes of the positive anchors are passed to the Geo-Localization Network to regress the geo-coordinate.
How results look like.
Here are some example results for the different outputs, object detection along with re-identification, and geo-location prediction.
We present an end-to-end multi-view detector that re-identifies and geo-localizes static objects. This method integrates Graph Neural Networks which add flexibility in re-identification making it possible to accommodate any number of views and still be computationally efficient. Furthermore, our approach is robust to occlusion, neighbouring objects of similar appearance, and severe changes in viewpoints. This is achieved using only RGB imagery along with its meta-data.
For the interested reader, we refer to our full research paper: here
Samy Nassar A, D’Aronco S, Lefèvre S, Wegner JD. GeoGraph: Learning graph-based multi-view object detection with geometric cues end-to-end. arXiv. 2020 Mar:arXiv-2003.