How to Monitor Trees in a City Using Graph Neural Networks?

Ahmed Nassar
Sep 17 · 5 min read
Image for post
Image for post
Illustration of our multi-view scenario. Red circle: camera acquisition location. Green circle: target object to be detected. Orange line: distance between camera and object. (Imagery ©2020 Google)

In order to aid tree health and status monitoring, we use Graph Neural Networks to geo-localize them in urban settings in an automated and efficient way.

Why monitor street trees?

Cities worldwide have initiated efforts to combat the unprecedented rising temperatures and heatwaves caused by “Urban heat islands”, which is a result of covering natural land with impervious surfaces (concrete, pavements, buildings, etc.). It turns out the solution is simple, plant more trees. Trees provide shade that covers the impervious surfaces, dispersing radiation, meanwhile releasing small amounts of vapour from their leaves creating a cooling effect. Another reason to monitor tree is to enable long-term tree health studies, have a look at some of our related works here and here.

Geo-localizing trees in urban settings manually by in-situ surveys of field crews or volunteers is a laborious task, especially in large cities, this becomes unfeasible. Fortunately using deep learning, we can “crawl” through a city’s available imagery, and perform this task at a large scale with few resources and labour.

Our task is to Detect, Reidentify and Localize static objects (precisely trees) in urban settings using multiple sources, and multiple views. Most methods require sequences of images with depth, include cameras intrinsic and extrinsic values, or perform the task on multiple stages. We utilise Graph Neural Networks (GNNs) to achieve this in a flexible, and efficient manner.

Keep it simple, no extra sensors needed…

We rely only on geotagged images that are already available for general purposes, like Street View and social-media geotagged imagery. We believe that it is unnecessary to use a special rig with sensors or depth cameras and that relying on the images’ metadata and geometry to accomplish this task is sufficient.

What do you mean by geometry?

Alongside the visual features, the images come with useful metadata. This metadata usually includes the camera’s heading and geo-coordinates. Using geometry, we can assign geo-coordinates to pixels inside the image, and vice versa: finding pixels corresponding to geo-coordinates. (explained in detail in our paper)

Image for post
Image for post
Try it out yourself, click here. (Imagery ©2020 Google)

We created a web tool to demonstrate these functions in action. First, click anywhere you’d like on the street to grab the closest 4 panoramas to that spot. Then, if you proceed to move your mouse around it will grab the geo-coordinate and project in pixels inside the 4 views.
These projection functions give us a rough estimate of a location, which helps us to predict which trees are corresponding to each other in multiple views, and not to count an instance of a tree more than once.

Representing the scene as a graph.

Image for post
Image for post
Grey circle depict nodes, yellow and orange edges depict matching and unmatching nodes respectively. (Imagery ©2020 Google)

In contrast to our previous works (Wegner et al., 2016, Nassar et al., 2019) we represent our data as graphs, with the nodes representing the trees, and the edges between them depicting the correspondence between the different instances of the objects. The nodes carry the CNN features of the object, meanwhile, the edges carry the ground-truth value ([0,1]) if its a match or not. This setup gives us the flexibility to have a variable number of images and targets as input.

This sets out our problem as a “link prediction” task, which can be solved with Graph Neural Networks (GNNs). The literature on GNN is extensive, check out here and here to understand further what GNNs are and the difference between them and CNNs.

An end-to-end method.

We strive with this work to come up with an end-to-end method that isn't composed of multiple stages that have to be trained separately, and tweaked parameter wise. So we created a method that could match the correspondence of objects in the scene to avoid double counting and provide the geolocation.

Hover over the diagram’s components for further explanation.

Our method works by following these steps:

  • A batch of images from multiple views and the corresponding camera metadata are passed through the backbone network (EfficientNet) and the multi-scale feature aggregator (BiFPN) of the object detector that provides different levels of features.
  • Anchors are then generated across the feature layers and passed through two sub-networks to provide classification and bounding box predictions. Based on the IoU of the ground truth with the anchors, we select the positive and negative anchors.
  • The features of these anchors are used to generate a dense fully connected graph.
  • The graph is then fed to a GNN to predict if the nodes should be matched by classifying the edge connecting them. In parallel, the regressed bounding boxes of the positive anchors are passed to the Geo-Localization Network to regress the geo-coordinate.

How results look like.

Here are some example results for the different outputs, object detection along with re-identification, and geo-location prediction.

Image for post
Image for post
Sample results for multi-view object detection and re-identification obtained on the Mapillary dataset. Here all detected objects (signs) were both detected and further re-identified (cyan). (© 2020 Mapillary)
Image for post
Image for post
Sample results for multi-view object detection and re-identification obtained on the Pasadena dataset. Here all detected objects (trees) were both detected and further re-identified (cyan) due to the higher similarity between views. (© 2020 Google)
Image for post
Image for post
Sample results for geo-localization obtained on the Pasadena dataset comparing different methods. Green, blue, yellow and red circles represent various methods. Check our paper for the full comparison. (Imagery ©2020 Google)

In conclusion.

We present an end-to-end multi-view detector that re-identifies and geo-localizes static objects. This method integrates Graph Neural Networks which add flexibility in re-identification making it possible to accommodate any number of views and still be computationally efficient. Furthermore, our approach is robust to occlusion, neighbouring objects of similar appearance, and severe changes in viewpoints. This is achieved using only RGB imagery along with its meta-data.

For the interested reader, we refer to our full research paper: here

Samy Nassar A, D’Aronco S, Lefèvre S, Wegner JD. GeoGraph: Learning graph-based multi-view object detection with geometric cues end-to-end. arXiv. 2020 Mar:arXiv-2003.



Research lab: machine learning, computer vision, and remote sensing to solve ecolo

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store