Machine Learning for Road Condition Analysis Part 3: “No-surrender Deep Learning”

Published in

Frontier Tech Hub

15 min readJul 1, 2020

The Z-roads team had faced a host of data challenges in preparing for modelling road conditions (covered in near painful levels of detail in the previous blog, so as to relay to the reader a sense of the weekly torment the team felt!). First there was a long battle just to get hold of the project’s key input datasets, and then the deluge of problems followed with the drone imagery itself: blurring, overexposure, misaligned tiles, inconsistent warping, censored geographies, network data that didn’t match reality — each week brought a brand new hornets’ nest of issues.

These have to be expected in AI projects in international development contexts. And we were engaged in a process of no surrender machine learning after all. Giving up wasn’t an option, and after hundreds of hours trying to generate creative solutions to flaws in the data, we finally got there. Yes, we were down from 70,000 patches to just 5000 to train our AI from, but they contained valid pixels, labelled with carefully surveyed ground truths. We were ready for the project’s deep learning stages.

Convolutional Deep Learning for Road Conditions

As with most of our lab’s projects, the modelling undertaken during Z-Roads was designed into two stages. Stage 1 consisted of our exploratory work, with experiments being undertaken in parallel to our data prep. It’s in these stages that we learn more about the nature of the data, the effect of its distributions, quirks and idiosyncrasies. We tend to only use rapid (and relatively rudimentary) machine learning pipelines at this stage. These rough and ready approaches can be run more quickly, with the initial results they output throwing a lens onto the issues which might arise. In fact, it was through these experiments that many of the key data challenges detailed in the last blog were detected. These pilot stages are crucial to ensuring our Stage 2 analysis is cogent — and this is where the project really puts AI models through their paces computationally.

Before diving into Stage 2, let’s remind ourselves of the task at hand:

Investigation of whether edge Artificial Intelligence techniques are capable of robust automated road condition analysis using UAV imagery.

To assess this we were using 8-cm resolution imagery that could be paired with road condition data collected from sensors (see the first blog for how we obtained these). This allows us to formulate a traditional supervised machine learning task, where UAV imagery segments of roads can be fed to the classifier alongside the surveyed IRI (International Roughness Index — a measure of the road smoothness) measurements during the training phase. So a well defined problem.

But there are many ways of attacking this challenge. When applying “AI” (a term none who does AI would ever use outside of blogs!), there are in fact a plethora of models to pick from. For image analysis, the state of the art remains a class of deep learning models called Convolutional Neural Networks (CNN), to which we restricted our analyses due to their known, excellent performance in similar tasks (and their increasing use in analysing Earth Observation data sources). Luckily we don’t have to build such networks from scratch…

You don’t have to start from nothing — Transfer learning

A convolutional neural network consists of an input (into which you feed your image pixels) and an output layer (a “probabilistic” road condition class prediction), as well as multiple hidden layers that act to map the relationship between the observable features and the probabilistic classification in a complex and non-linear way. Despite a huge number of moving parts (i.e. parameters) CNNs are extremely effective, due in no small part to advances over the last decade in training techniques. However, you often don’t want to start building a CNN from scratch, due to the myriad ways that they can be put together. The multiple hidden layers of a CNN typically consist of a series of what are known as “convolutional layers”, all interlinked to each other. The exact operations performed by each layer and how these layers interact define the network’s “architectures”. And the space of possible architectures is vast.

So often researchers will leverage “transfer” learning, and lean on architectures which have already been discovered and are known to work well — and then adapt them for the task at hand. The Z-Roads project took this approach, assessing two different state-of-the-art architectures — both VGG and Resnet — and then loosening their constraints. Stage 1 trials were revealing that our most promising architecture was Resnet, and so the rest of this post is about our experiences putting it through its paces!

Drone imagery as a CNN input

Feeding in drone imagery to a CNN is not always easy. A drone traditionally captures multiple overlapping images of a region during its flight which are then “stitched” together to obtain a global representation of the area flown over. The result is a set of very large, continuous images, or “tiles”, covering dozens of square kilometres, if not more. Tiles are far too large to be ingested by a CNN at their native resolution (especially with drones now achieving a spatial resolution in the magnitude of cm/pixels!). In fact, CNN were designed against a setting of relatively low resolution images, often of handwriting (specifically numbers), or endless pictures of cats and dogs (and for some reason benches). So the space for pixels in input images is generally pretty small. This already provides a clash with the incredibly high-resolution data you get from UAVs.

But we had an even more involved challenge. We only cared about roads. We simply didn’t want most of the information sitting in the tiles — non-road pixels in our data (beautiful as they were) were somewhat redundant. This is why part of our data preparation was to reduce the size of the imagery to be fed to our models, extracting only road segments in small square patches. We had about 5.5k of these all with attached IRI measurements, and ready to use in training our classifier. This was far less than the 70,000 we were planning (most of which did not pass quality control), but still enough to take a run at modelling.

Unnecessary contextual information

But even our small squares had a lot of noise in them still. Each patch we extracted had a spatial coverage of 18 x 18 meters, centred on a road segment (which we’d ensured in our filtering). These inputs simply had to be “square” due to requirements of CNN classifiers, and this meant that even though patches contained road surface data, they still contained a significant amount of non-road pixels — vegetation, houses, cars, water, benches (so they are still in there, if without the cats and dogs!). And this was an issue.

Examples of the different road conditions from the Zanzibar Region.

The problem is that non-road data provides a way for a classifier to potentially “cheat” — it can learn to recognize non-road information (e.g. the quality of surrounding buildings). From this surrounding information it can guess surface quality of adjacent roads based on context. This is highly detrimental to the generalizability of any models produced. Just because a segment is in an urban area, where roads are traditionally better, it doesn’t mean this particular road patch will have a good surface (and vice versa in rural areas). The classifier mustn’t just “assume” this, or its functionality is undermined — it’s the road surface pixels themselves that must count.

In order to address this issue, we might cut out the roads themselves, and eliminate any surrounding “contextual” pixels. Given we’d already had to painfully trace 700km of roads by hand (see the last blog for the reason!), the last thing we had time to do now was outline them too. We tested buffering around the road vectors we’d produced, but variability in road widths put an end to this. And we thought of creating more AI to detect road pixels — but again this would require a whole new project extension, so it was impractical (while road detection on paved roads has made some recent advances, detecting dirt tracks, surrounded by barren wilderness and indeterminate edges certainly has not!).

So with an already compressed timeline, we opted to reduce patch size, to ensure that as much of each patch was covered by road pixels as possible. After empirical testing, a reduced patch size of 144x144 (~10m x 10m) was chosen, to ensure a good trade-off between road and non-road pixels in each patch. Our AI was going to be an expert in close ups!

The dangers of spatial correlation

Ok, so we had our data and we had our model class. But this isn’t enough of course — a crucial part of investigating AI, is working out an evaluation strategy. It’s easy to get a prediction out of a model — but what is really important is how right or wrong that prediction is against some measure of “correct”. And certainly you mustn’t evaluate models just on their in-sample “fit”.

To be honest, this is really where “AI” research differs from traditional statistics (and why it works so well in the real world) — we tend to obsess about how well our models can predict new, unseen, data rather than the data we’ve seen before. “Generalisable” predictive performance is paramount to us.

We strive to achieve this by splitting our data into three independent sets: first a set of training data (used to fit an AI model’s internal parameters). Then a set of validation data (to help tune overarching “hyper-parameters”). Then finally a completely separate test data set (to see if the final model produced, stands up to new data, as it will have to in the real world).

Often people will use a “random sampling” technique to allocate data into each set (in our case the 5443 road patches into one of these sets at random). Z-roads did this initially in its pilot stages, due to simplicity. But it would be a mistake to do in Stage 2 due to the dangers of spatial correlation, often overlooked in geospatial AI.

The key problem would have been that, through Random sampling, 10m patches from the same road would be spread across the training and test sets. This would be disastrous, as those sets would no longer be independent. The AI would be able to “cheat”, not recognizing a road surface, but the actual road itself (based on width, road characteristics, surrounding context, etc.). This sounds unlikely perhaps, but our Stage 1 tests showed it absolutely was happening. AI models, rather than assessing a patch’s own road surface, would infer information almost like an atlas lookup.

This highlights the importance of having spatial independence between patches from the training set and the testing set. Without careful consideration, in geospatial-based AI, information can easily “leak” out by accident, and give a machine learning algorithm an unfair advantage.

In order to ensure this ‘geographically biased’ sampling did not happen, Z-roads took up a region-based sampling scheme, where full regions were randomly selected to appear in either the test, validation or training sets. Figure 1 shows the regions that were created for this purpose. For each training run of our modelling process, road segments in 3 regions were partitioned off to form a geospatially independent test set, to ensure experimental results would correspond to real world performance. Now our AI would have to look at actual road quality itself.

**Figure 1:** Breakdown of the 25 geo-regions created to ensure we had spatial independence in our test and training sets.

Designing new Road Classes for International Development

You also have to decide what your AI is going to predict. We could have done “regression” (predicting a continuous value rather than a class) and tried to predict IRI measures in our case. But these are neither easy to assess performance against, and bear no relation to how a system would be used in practice (Engineers tend to use classes to describe roads ranging from A:Very Good to E:Very Poor). We also focused our analysis on unpaved road classes, which really suffer during rainy seasons and are of much greater interest to the Department of Roads.

So how should we slot our IRI sensor scores for each road into a class. We tried the on-board bump integrator classes and these turned out to be useless. They simply weren’t designed for East Africa, and classes all unpaved roads as…. well, simply “very poor” (or category “E”). This is not helpful to someone trying to delineate unpaved road conditions! This can be seen in Figure 2 below, which shows that while paved roads have a reasonable spread across classes (the grey distribution), unpaved roads (the red distribution) are almost all caught in a single category despite having a far higher variation! Useless.

**Fig 2.** The traditional road classification system used has 5 categories, “A” (Very Good), to “E” Very poor. As can be seen in the above diagram, almost all Zanzibar unpaved roads (red), have a Bump Integrator IRI score (BMI) that previously categorizes them into just one category — providing zero information to either AI or decision makers. Thus the Z-Roads project generated the new road classification categories in partnership with stakeholders, which are now in use with the Department of Roads.

This issue reflects another one of the data challenges that hit the project. And while it wasn’t detailed in the previous blog (given how many other issues there were), the project had to work intensively with the Zanzibar Department roads to design a new, extended “unpaved” road class schema just to get going — as shown in Figure 3 below.

**Fig 3.** The Z-Roads project generated new road classification categories in partnership with our in-country stakeholders, which were used as predictive outputs for our AI, but also in software now at the Zanzibar Department of Roads.

This gave us 4 classes to categorize unpaved roads, with similar distributions inside. Due to having less than a tenth of our expected training data, we considered these, along with a binary classification task. Our stakeholders cared most if an unpaved was “ok” (categories D/E1), or whether it really was too “poor” and needed attention (categories E2/E3). An additional “unknown” category was also incorporated after practical discussions, based on the CNN classification confidence and indicating the potential to be in that fuzzy area between classes (we referred to this category as ‘Needs Review’, as this is exactly what engineers would do).

Putting it all together

So our setup is finally framed. We now had 5443 patches with associated labels that could be used to build our refined model. Patches were upsampled to 224 x 224 pixels (the minimum input size for the CNN ResNet architecture used) through bilinear interpolation, as otherwise (mathematically), the successive convolutions in the architecture wouldn’t be defined. In order to try and combat the extremely reduced size of our dataset, data augmentation was also extensively used to maximise the information contained in the set. Random combinations of flips, rotations and color jitters were applied to our patches to expand our training population. Time to press the button.

Transfer Learning

As discussed, due to the limited size of the dataset at hand, it wasn’t practical to train an entire deep network from scratch — so a transfer learning approach was leveraged. Our CNN Resnet models were therefore pretrained on a very large dataset (ImageNet in our case, which contains 1.2 million images with 1000 categories) as an initialization point for the task. As is standard practice with this kind of approach, our road patches were normalized to match ImageNet dataset’s characteristics, a requirement of transfer learning approaches to maximise the efficacy of the pre-trained weights acting as feature extractors. The first 9 layers were kept frozen and acted as pre-learnt feature extractors from the ImageNet dataset. Gradually you want to unfreeze more or all of these layers to adjust the pre-train weights to the specific road condition classification problem but that would require significantly more example data.

A spatial cross-validation strategy was employed, which means that the dataset was divided into the different subsets described previously, guided by our spatial grid. These were then iteratively used as a training (further broken into training and validation sets) or testing sets. The majority class was downsampled to produce a balanced dataset (the same number of images per road condition class) to ensure the model learnt to distinguish between all categories with equal importance. This additionally enabled the evaluation accuracy to be interpreted in a more straightforward and meaningful way — a random chance classifier would predict with 50% accuracy.

Assessing Results

After much iteration, and despite the immense data challenges faced, remedial solutions, and limited and slightly blurry input data — the project was able to produce AI models that were 73.3% accurate at 90% confidence. This is highly encouraging, and testament to both the work of the Zanzibar Mapping Initiative and the Deep Learning community. We could have got accuracy higher, but only by reducing confidence. And with both more and better data, this score can only improve in the future!

What does all this mean. Well first, results presented were obtained by aggregating the results of our evaluation folds (so it is as though we ran our experiments several times), so we believe this a robust estimate of real world performance. AI can indeed enter usage in road condition analysis.

Second, as CNNs belong to the family of probabilistic models, they not only output the class predicted but also provide a confidence level at which the prediction was made. This probabilistic output is key for any real-world application and we selected a minimum threshold of 90%. This is an arbitrary choice, based on discussion with partners, as segments that didn’t meet this confidence entered the ‘Needs Review’ category. This isn’t ideal, as at a confidence level of 90%, over 20% of road patches had to be entered into this bucket. But it turns out that this might not be such a crucial issue. We note that this ‘uncertain’ set could be rapidly reduced to just 10%, if we accept a reduction in overall system accuracy of only 2%. But it seems that as we get larger datasets of better imagery this category would dwindle completely anyhow. And moreover, with roads being far longer than our 10m classified patches, having some tiles in this category didn’t leave many roads without some form of classification. So for all practical purposes a high 90% confidence level was deemed applicable.

Third, these figures are presented in terms of “Classification accuracy”, which describes what percentage of the time is the prediction correct when the model makes a prediction. There are multiple alternatives to this assessor of efficacy, but classification accuracy reflects the simplest in a balanced dataset. However, future work might assess the actual ‘cost’ of making an incorrect prediction. For now, and with other scores such as F1 and ROC producing similar results, classification accuracy paints a good picture of the credibility of the AI approach.

Wrapping Up

The Z-Roads pilot final model evidenced 73.3% road-condition prediction accuracy from AI models, correctly identifying three-quarters of 10m road segments, using a filtered drone imagery dataset.

Importantly, we should see this 73.3% figure as our baseline result because it represents a level of minimum effectiveness for AI models. This is far from the end of the story, and rather than reflecting the overall effectiveness of AI models, it is rather than a lower bound on what we can achieve in the future (limited only by poor data inputs). These remain encouraging results. Such a practically useful level of accuracy, at just 10m granularity, indicates that automated classification is both possible and can achieve sufficient accuracy to be of use.

An odd outcome of the project was, however, to recognize that we were asking the wrong question. AI can do road condition assessment… if it has good inputs. It is the challenges of collecting high quality drone imagery in East Africa that is the real stumbling block. It is costly, needs regular collection and requires much training on behalf of drone pilots. If this can be addressed, and supplied imagery improves in quality, accuracy of the AI will also increase. There is much potential for significant further improvements in accuracy. But the bottom line is, even with a non-perfect dataset, the project estimates a reduction in workload of 65.8% compared to manual checking/examination. Which has to be a good thing in a region which has limited manpower and resources. So there we have it. The Z-Roads project was quite a journey (and this blog isn’t the end of it — we still had to deliver some working systems!). But the project taught us that:

AI can do automated road condition analysis using UAV imagery
There are a huge amount of potential bumps along the way (pun intended)

In particular, lesson 2 led us to our final system, reflecting a workflow that is needed for both training and utilizing an appropriate machine learnt classifier, as shown in the Figure 4 below:

**Fig 2.** The final Z-Roads AI workflow

Research wise we were done. However, we’d made important relationships with our partners in the Zanzibar Department of Roads and felt committed to giving them some software they could use. We wanted to ensure our work had actual impact — and that will be the topic of the final Z-Roads blog.