Road Feature Detection & GeoTagging with Deep Learning

David Yu

Published in

GeoAI

17 min readJun 6, 2019

How to utilize MMS data to identify and locate road features for automated, efficient asset management tasks.

Authors: David Yu, Hayley Miller

Disclaimer: This article makes use of data released from AZGeo with consent from the Arizona Department of Transportation. The full disclaimer can be found at the bottom of this article.

Overview

In this article we will go over how to:
1. Generate augmented training data for road features from MMS imagery
2. Perform object detection using Faster R-CNN and tie in Microsoft’s Computer Vision API for optical character recognition.
3. Infer an object’s latitude and longitude algorithmically, and then apply machine learning to enhance our prediction.
4. Perform clustering analysis using ArcGIS Pro to refine our outputs.
5. Visualize the results in the Operations Dashboard for ArcGIS
6. Use Collector for ArcGIS to integrate mobile field capability to the workflow.

So sit back and enjoy!

Complete road feature detection & geotagging workflow, all training was carried out on the GeoAI Data Science Virtual Machine (DSVM) on Azure, which offers a ready-to-use environment for training machine learning models along with ArcGIS Pro pre-installed.

Preamble: Laying the groundwork for an ML approach to road feature detection

One of the fundamental promises of ML has always been the capacity of trained models to generalize to new data. The ease of extensibility of a well-trained model has in turn allowed for automation in both data collection and insight extraction/consumption. In the case of road features, the ability to extract actionable insights from existing repertoires of data such as geo-referenced photos and videos captured from Mobile Mapping Systems (MMS) would provide a low-cost solution to a variety of use cases common to many local government departments or private entities. These use cases include but are not limited to: inventory management, damage assessment and change detection.

We begin our investigation with MMS data because they come in a variety of formats, are easy to collect en masse and, in the case of pure imagery without video or LiDAR outputs, are relatively inexpensive to collect.

A typical mobile mapping vehicle with top-mounted sensors and cameras

Of course when it comes to cost-shaving, an often overlooked avenue of data is crowd-sourced images, such as those available from Open Street Map. The downside to using these images is that they may not include the necessary metadata to perform accurate geolocation, come in a variety of non-standard formats and aren’t always available if a systematic sweep of a select route or region is needed.

Road features are any objects or points of interest that can be captured and identified from MMS data. They can be categorized as asset features and non-asset features. In the case of the former, depending on the owner of these assets, they can constitute items such as road signs, road damage, guard rails, hydrants, trees, roadside vegetation, vehicles, debris, etc. In this case study, Esri worked with the Arizona Department of Transportation (ADOT) to identify, geolocate and classify highway road signs as part of an inventory management demo.

Some examples of road features, each may be of interest to a different party

We can break down the aforementioned solution into three distinct pieces in terms of architecture and functionality: detection/classification, optical character recognition (OCR) and geotagging.

Let there be Training Data

The AZGeo dataset consists of a photolog database of ~2.5 million images taken in the year of 2016, which extensively covers all the interstate freeways, state routes and streets of the State of Arizona. These images are taken from the same MMS platform and have a uniform resolution of 640×480 px.

MMS training images captured at regular intervals, only a fraction of which contains road signs

The LabelImg library is a great open-source python library that provides a user-friendly Qt GUI for generating training labels in either Pascal VOC or YOLO formats. I used LabelImg to create 766 training images which took around 3 hours to do. For the purpose of this workflow, because we are only interested in detecting highway signs which don’t tend to involve a lot of variation, it suffices to use a set of weights for the backbone ResNet101 that was pretrained on ImageNet. For other use cases such as detecting street signs in an urban setting, it is highly recommend that you pretrain your network on the LISA dataset which contains a little over 6000 images of common urban and highway street signs at a mixture of resolutions. Unfortunately, the annotations that come with these images do not conform to any standard formats so you would have to write your own parser to handles these annotated files.

Drawing and labeling bounding boxes in LabelImg

The problem with generating such a small set of training images is that most CV models would tend to overfit on these images, regardless of whether your model was pretrained or not. This problem becomes especially severe considering that you would have to set aside a certain portion of the training set for testing and generating evaluation metrics. To get around this problem, it has become standard practice in the CV community to apply various augmentation techniques to your initial set of images. The imgaug library is a great library to use and offers augmentation techniques such as affine transformations, random cropping, dropout, gaussian blurring and many more on bounding box, keypoints or segmentation mask labels alike.

imgaug before & after: The top left image is the “before”, and everything else is “after”. Note that the bounding boxes are transformed in the same way as the images themselves. Some of these transformations also work for key points and masks.

Here’s the list of sequential transforms I used by the way, which are modified from the examples in the official imgaug documentation, which I found to work quite well for street signs. I recommend checking out the documentation on how to apply these transforms.

Detection & Classification

I will not cover too much material re: detection & classification here, suffice it to say that a Faster R-CNN network was used. Please refer to this article on how Faster R-CNN may be applied to perform parking lot occupancy detection. If you are interested, this article explores the architecture in greater detail.

In terms of performance, through experimentation we found that inferencing takes around 0.3 seconds per image, which seems quite fast until one takes into account that to cover just a single route (e.g. Interstate 15 in our example which has ~6600 images), the process could take up to 33 minutes.

A second option to performing detection is to apply the YOLOv3 network in order to perform detection which significantly reduces the inference time (down to 0.05 seconds per image, or a mere 5.5 minutes for Interstate 15).

Factoring in Optical Character Recognition

For the use case of sign detection, it may be advantageous for inventory management to also determine the texts of detected signs. This would allow for easier signage indexing or to perform text-based queries on these indexed signs. Traditionally, OCR tasks have always been built from the ground up using python libraries such as pytesseract. Unfortunately, these libraries were designed to parse scanned documents and therefore the user would typically need to perform various enhancement techniques prior to obtaining a somewhat usable input image, and even then, the results tend to be sub-optimal.

An example of how image preprocessing is carried out for OCR traditionally: [Top left] Original image is converted to grayscale to get rid of color channels and make subsequent operations easier. [Top right] Perform edge detection which in this case uses the Canny Edge Detector from OpenCV. [Bottom left] Use Hough Line Transforms to find the edges of an object and perform rotation to straighten the image (angled texts generally don’t perform well with Tesseract OCR. [Bottom Right] Threshold the image to create sharp pixel gradients.

Luckily, Microsoft’s Computer Vision API simplifies this process greatly by being flexible enough to consume a multitude of image formats and styles while performing all the text detection, RoI enhancements and text cleaning processes under the hood.

The Computer Vision API produces the following text detection and bounding box information that can then be visualized over the original image.

However as great as this API is, its performance nonetheless falls short given low-resolution images, and doesn’t quite stack up against human vision especially in cases where text detection is often aided by environmental context (being a domain where human vision excels). This is when we have to go the extra mile in order to squeeze the remaining bits of information from these images.

[Top] Original. [Middle] Output from super-resolution. [Bottom] Output from waifu2x

Super-resolution is the standard technique to tackle situations where high-resolution (which does not automatically correlate with high-information-content) inputs are called for. We have carried out some intriguing applications of super-resolution in the past to tackle upsampling images for building damage classification or vehicle detection. Though super-resolution excels in the domain of photo enhancement, when we consider text images or scanned documents, we must remember to strive for image clarity and sharpness rather than fidelity to the underlying distribution of ground-truth pixels, which might otherwise produce a more realistic but blurrier image. With these constraints in place, it makes more sense to use traditional methods of image upsampling such as 2D nearest-neighbor, bilinear or bicubic interpolation. However, it should be noted that there is an amazing library out there called (I kid you not) waifu2x that performs super-resolution on anime-styled images which happens to work great for road signs. In fact, this website lets you input your own images and see the end result after being waifu2x’d and its a pretty good way to spend a Friday afternoon if you have nothing better to do.

The GIS twist that brings everything together: GeoReferencing

Up until this point, what we have built does not constitute a geo-enabled solution. If not utilizing all your available data to the fullest constitutes a crime in data science then not incorporating GIS elements given MMS data should be considered a cardinal sin, given the vast amount of geo-metadata that is embedded within each image. For the AZGeo dataset, the following information is contained within each image.

[Top] Metadata associated with each image. Noteworthy attributes are ImageCount(unique ID), Longitude, Latitude and Bearing. [Bottom] Camera data. Note the Heading, Roll and Pitch

The lat/long coordinates are measured by GPS with a degree precision of 12 mm (though not necessarily accurate to within 12 mm of the target). This is all well and fine save for the thorny problem that these coordinates are not representative of the true lat/longs of objects of points of interest (PoI) contained within this image. This is not an easy problem to tackle, nor would a working solution ever be trivial because the most crucial piece of information needed is missing: how far a PoI is from the camera. Knowing this piece of information, one could easily arrive at an algorithmic solution. However even human observers can sometimes find it difficult to ascertain exactly how far away a thing is from the camera.

One thing we humans do have going for us though, is a built-in prior that shines a little context on the problem. We know a sign is roughly x meters away because we have a rough idea of how large road signs are in real life. However without a sense of scale guiding our judgement as in the picture below, even we would be at a loss.

How big is this rock? Can you tell without context?

This insight into how we humans judge distance gives us a starting point into tackling the problem of distance inference, if only we could determine how large an object is in physical space. A naive approach to this problem should be immediately obvious: simply perform classification on signage to a reasonable degree of granularity, and then assign a physical size to each sign class. Fortunately, the Federal Highway Administration has made publicly available a Manual on Uniform Traffic Control Devices (MUTCD) which precisely delineates the dimensions of each class of signs. The downside to this is that the regulation on dimensions of certain text signs such as destination signs (D1–3) and jurisdictional boundary signs (I-2) are only accurate to within a certain range or are only semi-bounded with no upper limit restrictions on size.

Nonetheless this gives us something to work with, and we can now quite accurately determine the physical sizes of non-text signs such as warning signs, speed limit signs, state route markers and mile markers. With this information, the distance of the sign from the camera can easily be calculated using similar triangles:

How to infer the distance an object is away from the camera (F) from focal length (f) and the size of the image on the CCD censor. Once this distance is found, simply figure out the angle between the object and the image bisector to obtain the necessary information for inferring the lat/long of the object.

Finally, we must also take into account auxiliary information such as the bearing of the camera, which is made up of the bearing of the vehicle and camera yaw. We will for now, not consider other parameters such as camera pitch and roll and how they would contribute to changes in inferred distance.

More accurate distance inference with Deep Learning

The above approach to distance inference gave us a good starting point to improve upon, but is by no means foolproof. Apart from the issue of non-deterministic physical sizes on select classes of signs, we also have the issue of perspective distortion should a sign get really close to the camera. What we need to figure out then, is whether we can bake an additional prior understanding of size into the model itself. Where there’s a will, there is a way involving deep learning.

The idea of applying deep learning to solve distance inference is itself not very novel, but the approach taken by the monodepth library is a very refreshing one: Rather than train a convolutional neural net to minimize the loss between an inferred depth-map and some kind of ground-truth map informed by LiDAR data, monodepth uses a dual-camera set up to capture data for training. Back propagation in this case involves adjusting the network’s weights so that it is able to reproduce a right-lens image from a single left-lens input (and vice versa). The corollary is that the model builds up an intrinsic understanding of depth over time, and this is all achieved without any expensive data collection equipment.

monodepth output on one of our MMS images. The street sign is clearly visible, and has a uniform gradient, meaning monodepth has assigned it a fixed depth. There are some artifacts on the left because the model hasn’t been optimized on highways in the US.

To figure out which pixel value corresponds to which distance is then a simple regression problem and can be done automatically on the testing set where signs have a known location.

The relationship between brightness and distance appears to be a linear one. A more accurate regression line can be obtained via transfer learning with local images captured from a dual-lens setup.

After integrating this ML solution for distance inference to our programmatic method, we are now able to precisely determine an object’s geocoordinates from a geotagged image. Neat!

Metrics, Metrics, Metrics

Of the five design aspects that define a machine learning workflow (data, model architecture, loss function, optimization method and performance measure), it is performance measure that is the most commonly overlooked yet most crucial facet. Throwing together a ML model without designing appropriate metrics would be like building a car and calling it a finished product without ever turning the key to see if it runs. The issue is that most developers are too lazy to test their models against a set of standard benchmarks, or to come up with custom metrics from the outset where standard performance measures are clearly insufficient.

To do this properly, I have created three separate metrics for each of the three outputs of the model: lat/long inference, OCR and detection/classification. But before any of that, we would first need to designate a set of testing data against which future benchmark tests would be carried out. There are two advantages to keeping a fixed test set: First, we can be assured that when performing model comparison that all non-systematic errors are eliminated. Second, depending on where we wish to deploy the model, we could manually incorporate images from that testing space to assess the model’s ability to generalize to foreign environments. The only downside to a fixed testing set is that lat/long of PoIs would have to be manually assigned, which we have painstakingly done over a period of two hours. As it stands, the current testing set consists of 43 images which does not sound like a lot but is plenty to illustrate the current level of model performance.

Lat/Long Inference
Two pieces of data are collected to measure the accuracy of lat/long inference. First, the L-2 standard deviation of inferred locations. This piece of information lets us understand the spread of inferred coordinates, and is something we wish to reduce over time. The other is the probability distribution of bearing of line AB where A is the ground-truth location and B is the inferred location. Ideally, the bearing should have an even distribution, indicating a lack of systematic errors which might otherwise manifest as clusters around certain bearing values.

OCR
Two metrics are used to assess the OCR accuracy: Number of words detected (each associated with the inferred signage distance) and the percentage of correctly identified words for all detections (based on a similarity index produced by Python’s SequenceMatcher). The ground-truth data in this case are manually entered into a python dict. The implicit rule I used when writing these strings is that I must not spend more than a second to interpret the text from a sign, and that I should certainly write out an empty ground-truth string if I deem the text to be indecipherable. Nonetheless, in the case of OCR, the model did not produce very accurate outputs for the few signs that contained any text at all.

Some particularly difficult OCR metric images. If it takes you more than a second to figure out what’s written, it’s probably too difficult for current methods to solve.

Detection/Classification
This is one of the few cases where a standard metric does the job better than any metric you can concoct. Since the detection model consists of Mask R-CNN or YOLOv3, the recommended performance measure is mean average precision (mAP). A detailed explanation of what this entails is given here.

Traditional metrics for VOC-style detection tasks vs. new metrics for current use case. Of course everyone is free to come up with whatever performance measure they see fit, so long as these metrics make sense in the context of the problem.

Integrating into the ArcGIS framework

The model in its current iteration is capable of consuming mass amounts of raw data and converting those into useful features. It is however not a full-fledged solution because there are no mechanisms in place for an end-user to easily consume these features and hence cannot derive actionable insights. This is where an ArcGIS end-to-end solution offers the user a decisive advantage through offerings such as real-time status monitoring using Operations Dashboard, post-processing techniques such as pattern analysis, spatial relationship modelling and clustering methods exposed through various spatial statistics tool sets, as well as ground crew management and coordination platforms like Workforce or Collector.

On the topic of post-processing, we realized that a pitfall with using MMS data is that unique objects tend to get identified multiple times from different camera frames as the MMS vehicle approaches the PoI. This would not be an issue if lat/long inference or object classification can be carried out without error, however this is typically not the case.

Using Spatially Constrained Multivariate Clustering to cluster signs by class_id in a given radius, then applying central feature to obtain unique cluster centroids. [Blue] Original sign detections. [Red] Signs that have been clustered.

Spatially constrained multivariate clustering is a great tool available as an ArcGIS Pro geoprocessing tool (alternatively available as an Arcpy function) which solves our problem. This tool allows one to specify which layer attribute to apply a constraint. In our case, we want to constrain clustering to only those points that share the same class label, and to only apply clustering within a Euclidean distance region from each point.

Operations Dashboard for ArcGIS

ArcGIS Ops Dashboard reveals insights at a glance

The Ops Dashboard for ArcGIS allows the user to use charts, gauges, maps and other visual elements to monitor geo-enabled assets. For the purpose of this demo, we partnered with ADOT to make a Dashboard to illustrate the detected road sign inventory along Interstate 15 NB in Arizona.

Ops Dashboard can also be used with Collector to reflect ground crew assessment of asset conditions in real time.

Collector for ArcGIS

The integration with Collector for ArcGIS brings a mobile field capability to the workflow. We imported the sign detection feature layer into a web map and added relevant attribute fields to the dataset. Crew members can add information to the dataset and verify machine learning processes in the field. In this instance, we wanted to verify the location of the georeferenced points and the validity of the sign classification, so we added those fields. We used coded domains to provide field workers with a pre-determined drop-down list of attributes. The ability to choose from the drop-down list prevents errors introduced by fat-finger typing and ensures better data quality and assurance. Additionally, the fieldworkers can use voice-to-text capabilities to report issues with detection, sign damage, or changes to the dataset. This, too, avoids fat-finger errors as well as saves time. Another benefit of the application is its offline capability that allows offline changes to be synced to the dataset when service is restored. This workflow streamlines the process of updating, editing, deleting, and adding points in the field.

These field updates were monitored in near real-time using an Operations Dashboard, as shown above.

Field editing was enabled as well as the ability to sync data and download map areas for offline data collection and updates.

The Collector for ArcGIS app allows the end-user to view the photos of each sign, as well as the location, and data gathered from the machine learning techniques above.

Future Work

This demo project is very much a work-in-progress, and as such, there are elements of functionality which we wanted to integrate into this solution but have not had the time to do so. Expect an updated blog in short order. In the meantime, we will continue to seek innovative ways to enhance our model and drive for better accuracy.

Full disclaimerTo the maximum extent permitted by law, the liability of ADOT, its employees, officials, and agents shall not include liability for lost profits, indirect, incidental, special, punitive or consequential damages, or claims from third parties to the public, or any loss of business, revenue or data, whether based upon a claim or action of tort, contract, warranty, negligence, strict liability, breach of statutory duty, or any other legal theory or cause of action, even if advised of the possibility of such damages. In additional to the disclaimers above, to the fullest extent permitted by law, any user of the AZGeo data portal agrees to release, and hold harmless the State of Arizona, its employees, officers, officials, and agents from and against any and all claims, actions, liabilities, damages, losses, or expenses (including court costs, attorney’s fees, and costs of claim processing, investigation and litigation) arising out of the performance of this agreement. : Although the Arizona Department of Transportation’s data in AZGeo has been produced from sources believed to be reliable, no warranty expressed or implied is made regarding accuracy, adequacy, completeness, legality, reliability or usefulness of any information. This disclaimer applies to both isolated and aggregate uses of the information. The Arizona Department of Transportation provides this information on an “AS IS” basis. All warranties of any kind, express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, freedom from contamination by computer viruses and non-infringement of proprietary rights are disclaimed. Changes may be periodically added to the information herein; these changes may or may not be incorporated in any new version of the publication. If the user has obtained information from the AZGeo webpage from a source other than the AZGeo Home Page, the user must be aware that electronic data can be altered subsequent to original distribution. Data can also quickly become out-of-date. It is recommended that the user pay careful attention to the contents of any metadata associated with a file, and that the originator (MPD GIS) of the data information be contacted with any questions regarding appropriate use. If the user finds any errors or omissions, we encourage the user to report them to the Arizona Department of Transportation GIS team at MPDGIS@azdot.gov.