NATO symbol recognition with neural networks

Erik Kõiv
CodeX
Published in
16 min readJun 13, 2022

Erik Kõiv, Mihkel Lepson, Mateus Surrage Reis, Karl-Kristjan Kõverik

Supervisor: Ardi Tampuu

Github link to the code notebooks: https://github.com/MihkelLepson/NATO_symbols

Introduction

Everything is becoming faster. With the rushing in of the digital era and the internet some decades ago, fewer and fewer people have ingested their daily news by reading a newspaper behind a cup of coffee or fulfilled their need for entertainment with a four hour play at a theatre. And this tendency has not slowed as the years have gone on; nowadays we consume bits and bites of information, with the vocabulary of the technological world worming itself into colloquial speech. As is apparent we are looking to convey information faster and faster, yet our methods cannot keep up. Human speech relays information at a rate shadowed by the first modems. “Indeed, no matter how fast or slowly languages are spoken, they tend to transmit information at about the same rate: 39 bits per second” [1]. As common data rates reach the gigabit range, a question rises: how do we bridge that gap? And that question, asked by the civilian world, is asked no differently in the military where this project looks to take its first steps toward answering it.

The project was indirectly commissioned by the Estonian National Defence Academy. It is based on the everlasting need in the military and defence forces of the NATO alliance to achieve faster and clearer communication between and within the allies’ forces. For this there has already been established a set of symbols to be used in the drawing of tactical plans [2]. These symbols are used to accurately convey information from one officer to another about the movements units will make and are essential to success of operations and minimising casualties.

Currently the plans are drawn, using this set of symbols, on plastic covers that are overlaid onto maps. Communicating these plans generally happens at meetings of officers where everyone is brought up to speed by leaning over a table and looking at a usually dimly lit map. However, the Estonian National Defence Academy hopes to improve on that situation in this here machine intelligence age.

Which is where our work began: we were to provide a partial machine ‘translation’ from symbols to words of these plans using images. The task initially was to be able to identify one symbol per image, but over the project duration evolved also into localising several symbols on any image taken of the clear plastic overlay with a white background. This method can be used since it will not jeopardise any of the plans would the data be intercepted, as the plastic overlays themselves are codified to not be easily connected to any map, but will greatly help understanding, digitising, and clearly communicating the plans in question.

The Data

Every machine learning project needs a good dataset to provide an acceptable result, and in many ways, achieving a good dataset is one of the most difficult parts of machine learning, because a good dataset will provide the used neural network with the best material to learn from, like providing a student with a good set of lecture materials to study for an exam.

In our case, the task involved 15 types of symbols that were selected for us from the much wider array of NATO symbols, giving us 15 detection classes. Of each of these classes we were given 10 examples drawn by cadets from the ENDC. As our initial goal was to classify each symbol separately in its own image, the initial 10*15=150 symbol dataset was created as such, each provided symbol as a separate image, with the class as the title.

Figure 1. Example cadet drawn symbol

You, the reader, might have already noticed our first hurdle: we only had 10 examples per class to learn from. As a comparison the MNIST database of handwritten digits (that contains data that is in essence very similar to ours) has a training set of 60 000 examples and a test set of 10 000 examples [3], magnitudes higher than what was available to us.

As the ENDC was unable to provide us with additional data, our first step was to go ahead and draw ~20 more examples of each class, reaching a dataset of about 450 unique symbols; not great, not terrible. This was done on white printer paper with a black marker, replicating the conditions presented to us by the provided dataset.

Example image with one symbol from each class
Figure 2. An example symbol from each class; class names added later.

Looking at the set of symbols together, we noticed another possible issue, which is that multiple symbols, namely cover, screen, and guard, are only distinguished from each other by a single character, and thus do not have many points to learn from for the network, possibly resulting in frequent confusion between the three.

After examining the data further for other possible negative points of interest we determined image brightness, symbol rotation and distortion due to camera angle as the factors we thought would most impact our results.

Methodology

Our work after creating the ~450 symbol base dataset started by wording clear problems and then one by one (or as it turned out, sometimes also in parallel) solving them:

  1. How do we further improve our dataset?
  2. Can we use a simple data model like KNN to solve the classification problem? (Since our data looked so similar to the handwritten digits dataset.)
  3. If the answer to 2. is ‘no’, then what neural network from what library/framework should we use?
  4. What is our final goal to work towards in the scope of this project?

Dataset Augmentation

In order to squeeze more data out of our limited dataset, and diversify/unify it in terms of previously mentioned parameters such as rotation and distortion, a collaborative augmentation tool was created applying random limited resizing, rotation, line-thickness change, and image flipping to an input image. This allowed us the creation of an even more diverse and larger dataset than simply drawing more symbols had given us.

Figure 3. Examples of rotating and adding line thickness to a symbol

K-Nearest Neighbours

Our first set of results came from multiple implementations of the KNN algorithm. Here each training image is taken to be a point in a space with as many dimensions as there are pixels in it, and then distances are calculated in between the image being classified and your training images. So, for instance, an all-white image would be ‘close’ to a very bright image, but ‘far’ from an all-black image. You then check the classes for each of its closest neighbours in that space, and the one with most near examples is the predicted class.

The greatest strength of this method is simplicity. The accuracy achieved was quite low, but better than chance. Using a larger number of generated training examples improved the accuracy by quite a lot, but given how few real data examples we had, this was likely due to overfitting. In addition, it should be noted that generating more than 100 000 symbols comes with considerable computation time both when generating data and applying the model.

Table of KNN results in terms of accuracy per number of generated symbols
Figure 4. KNN accuracy based on number of generated symbols.

Convolutional Neural Network

Seeing from the previous result that a simple KNN was not going to be the answer to all our problems (😞), another approach taken initially was to create our own classification CNN. Since AlexNet’s [4] debut in 2012, it has remained one of the most popular types of network used for image recognition and detection tasks. To simplify, the main idea behind CNNs is a large set of ‘filters’, each of which captures a specific kind of image pattern, and these are then composed together to make a decision on what the full image is showing.

To train the CNN model the training dataset was generated from existing symbols. For all of the training symbols the rotating, flipping, skewing and also changing symbols boldness was applied. Also for some of the symbols random pixels were changed to opposite color. By using these augmentations systematically and randomly we generated the dataset with 361 640 symbols. For training the model 5% of the data was used for validation.

Before building the CNNs architecture it was observed that symbols consist mainly of primitive shapes like lines, curves, circles and arrows. Also the symbols would be on a clear background which would eliminate a lot of noise. Due to the simplicity of the objects it was decided to look into architecture with three hidden layers. First two being convolutional layers and the last hidden layer being a fully connected one. For both of the convolutional layers the number of filters was chosen to be 50. The given size seemed to work well and was not changed during training. The filter size in the layers was 9x9 pixels. The 9x9 size filter size gave better results when compared to smaller sizes. With the 3x3 filter the model was not as accurate as with the 9x9 filter and also the model overfitted. The gap between training and validation accuracy with the 3x3 filter was 3%. With the 9x9 filter there was no gap.

Architecture of the CNN model
Figure 5. Full architecture of our CNN model, from 100x100 grayscale image to classification.

The hyperparameters which were used for training were:

learning rate = 0.0005, batch size = 64, epochs = 10

From the training history of the model in Figure 5 we can see that the model achieves good accuracy fairly quickly and then gradually keeps improving. It is worth keeping in mind that both training and validation data are from the same generated process and hence the actual accuracy of the model can be a bit lower.

Figure 6. CNN training history.

Additionally, due to the very large amount of symbols generated, overfitting could also be a possible reason for very high accuracy, but this was not tested for. However, to see how well CNN performs on the new data it was tested on 300 drawn symbols which it hadn’t seen before. Also in this case it should be kept in mind that the 300 symbols that were used for testing were similar to the training ones. Out of 300 symbols on the test set the model was able to classify 282 of them, which gives us an accuracy of 94%.

Figure 7. CNN confusion matrix.

The confusion matrix can be seen in Figure 6. We see that 11 mistakes out of 18 were confusing cover and screen. These symbols are fairly similar. Only difference is that one has letter C in it and the other has letter S in it.

Figure 8. Examples of some filters extracted from our own CNN. Intuitively, you can consider that each filter is ‘looking for’ its learned pattern, and activates when that pattern is found in the image.

Next Stop: Terminus

By this point we had already used a large portion of the time we had to work on the project and had determined that, although our dataset was too tiny (or had some other unknown issue) to be used as a basis for something like KNN, we could reliably construct a CNN to classify the symbols, given that they were presented to the network as single images with only one symbol presented on it. This is quite far from what would be a real-life application of the network, but would still fulfill what we consider to be a minimum viable result for the project.

However, during a meeting with our supervisor we discussed possibly using YOLOv5 [5] to both localise and classify the symbols on an image containing multiple symbols, bringing us much closer to a real-life use-case. So in the same meeting we decided this would be what we would do with the rest of our time on the project and devised a rough plan for creating a dataset for YOLO and training the network.

Creating A Dataset

The reason that another method of creating a dataset is necessary is because instead of using single-symbol images like we had so far, YOLO instead uses images with multiple symbols as a training basis. For this sort of a dataset the generic approach is to take a bunch of images containing the detection classes and then label them manually (huge pain) or using a tool such as Roboflow [6] (bit less of a pain).

Since we already had a 450-symbol dataset, we wanted to leverage that, which meant somehow generating a YOLO dataset from combinations of these 450 unique images. For this, we were in luck! The images all had a white or grey-ish background with a symbol drawn in black, allowing us to mosaic them together while knowing exactly where on each image the symbols would be placed.

Which is exactly what the Generate a YOLOv5 dataset section of the YOLOtrainer notebook accomplishes:

  1. Each single symbol image along with its label is loaded from a designated source folder that can have multiple subfolders
  2. Adaptive thresholding is applied to each image, seeking to equalize the brightness
  3. The loaded images are passed to a generator function, configurable for size of final image, number of symbols on each image etc.
  4. During generation, each symbol gets the previously mentioned augmentation added: rotation, flipping, resizing, bolding, distortion.

5. Final mosaic and data about the symbols on it is returned.

6. Image and symbol data is saved in YOLO format.

Figure 9. Early iteration example of generated multi-symbol image

With Figure 10 we can illustrate the main challenge with the symbol mosaic generation system: the thresholding was somewhat finicky to get exactly right so that it would correctly threshold images with a variety of brightnesses. Eventually we got it mostly right though and that is seen in Figure 11.

Figure 10. Example from final generated dataset

What is a YOLO?

YOLO as an acronym means You Only Look Once. It is currently at its 5th generation of active development and has generally been known for its speed and accuracy. The available pre-trained weights are trained on the COCO dataset and boast great performance.

Figure 11. Detection results of YOLOv5 on the COCO dataset [5]

Being pre-trained on the COCO dataset was another green flag for us to use YOLO as a solution to our project problem, because the COCO dataset includes classes for traffic signs, which in terms of shape are somewhat similar to the symbols we were hoping to detect.

Figure 12. YOLO object detection process [7]

The YOLO network architecture consists of three main pieces:

  1. Backbone — A convolutional network forming image features at different granularities (different levels of detail).
  2. Neck — Mixing and combining these features to pass them to the head.
  3. Head — Handles prediction based on acquired features.

For a custom trained YOLO model there are multiple metrics to help evaluate it:

  • Objectness loss — Shows the loss per each segmented cell for which detection is run
  • Classification loss — If object is detected, this loss is calculated based on probabilities for each class
  • Box loss — Shows the error between ground truth and detected bounding box

These can all be valued differently during training evaluation: low box loss is important when you want the bounding box to be very precisely detected, but is not as important when the main focus is instead on detecting a specific class, in which case classification loss is the most important. Objectness loss is a difficult metric to classify, but concerns the detection of object presence in each detection cell for which detection is run.

Figure 13. Calculation of precision and recall

In addition, precision, which shows the ratio of true positives over actual results, and recall, which shows the ratio of true positives over predicted results are also useful to evaluate a trained model.

A YOLO dataset consists of two items per image: the image itself, and a text file detailing the location of objects in images, their bounding boxes, and their classes. At runtime YOLO also visualises these input bounding boxes, seen in Figure 15.

Figure 14. YOLO bounding box ground truth debugging image

In addition to those two files there is also a .yaml configuration file the training script checks to get the location of the dataset, and names and number of classes.

Final Approach

Multiple training iterations were run. The more successful run logs were logged with a built-in run monitoring application called wandb.ai. All of these successful runs can be further explored here. For the runs a variety of hardware was used: a local laptop running RTX 2070, Google Colab GPU, and the University of Tartu HPC.

The general training command used was as follows:

python3 {train_dir} --img 640 --batch 24 --epochs 300 --data NATO-Symbols.yaml --weights yolov5m.pt --project NATO-Symbols-Log --name {run_id} --cache

Here you can see multiple variables called and parameters set at runtime.

  • train_dir is a variable pointing to the YOLOv5 training script
  • img 640 sets the size to which every training image is resized to, allowing the user to have any size images in the dataset
  • batch 24 gives the batch size
  • epochs determines the number of iterations to run training
  • data points the training script to the custom .yaml file described before
  • weights determines which starting weights to use, this can be either default YOLO pre-trained weights, or custom trained weights
  • project determines the folder name into which run logs are saved
  • name determines the run name
  • cache determines whether or not the dataset is cached in RAM for faster access during training

Another feature of YOLO is that the training script contains a flag to run genetic algorithm (GA) hyperparameter evolution for the ~25 hyperparameters of the YOLO framework. This is a time and resource intensive process, and was run for 100 generations of 10 epochs with the following command:

python3 {train_dir} --epochs 10 --data NATO-Symbols.yaml --weights yolov5m.pt --project NATO-Symbols-Log --name {run_id} --cache --evolve

There are more runs available for exploration at the run monitoring site, but three runs with different hardware, different base-weight sizes and different hyperparameters are going to be compared and analysed further here.

These three runs are called ‘5m-300-colabGPU’, ‘5l-300-hpc’, and ‘5m-300-local’. In terms of existence in the project’s lifetime, ‘colabGPU’ was run latest and ‘local’ the earliest. The hardware used is apparent from the run name suffix. The prefix ‘5m’ or ‘5l’ shows the size of base weights used, with m for medium and l for large. Small weights were not used as they are meant for training networks for mobile applications. Larger weights were used in the run on the HPC as there we had access to more than 15GB of VRAM which was necessary to run training with a reasonable batch size (>10) using the weights, since the amount of VRAM required rises greatly with a larger weights set. For the ‘5m-300-colabGPU’ custom evolved hyperparameters were used, the other runs used default hyperparameters. All runs were run to completion of 300 epochs.

Figure 15. Box and class loss graphs for the ‘5m-300-colabGPU’, ‘5l-300-hpc’, and ‘5m-300-local’ runs

First let us take a look at the class and box loss graphs of the runs. During training what was monitored was whether the losses started increasing as time went on, which could indicate overfitting. This did not happen, and moreover, we can see that the evolved hyperparameter run shows a faster loss decrease and a lower box loss than the other runs, despite the HPC run being run on a larger set of weights.

Figure 16. Precision and recall graphs for the ‘5m-300-colabGPU’, ‘5l-300-hpc’, and ‘5m-300-local’ runs

Secondly, the precision and recall metrics over the course of the training process show that while the models eventually reached a homogeneous level in both, hyperparameter evolution sped up the training process to be on par with the larger sized weights on more powerful hardware. This highlights the importance of hyperparameter evolution.

Figure 17. Mean average precision metrics for the ‘5m-300-colabGPU’, ‘5l-300-hpc’, and ‘5m-300-local’ runs

Finally, adding one more metric, mAP_0.5:0.95, which compounds precision and recall across all classes and generally shows overall model performance, we can see once again the hyperparameter evolved ‘colabGPU’ run to have trained faster and reached notably higher results than the other models.

Performance in ideal conditions when run on a generated image can be seen in Figure 19.

Figure 18. Ideal test case

As feared, ‘screen’ symbol has been mislabelled as ‘cover’, but otherwise everything is correct and confident.

Now, let us see how the model stands up to being challenged by a real, non-ideal, image with symbols that haven’t been taught to it.

Figure 19. More realistic test case

It is clearly visible that while several symbols have been correctly identified, there are also several cases of not detecting symbols of taught classes, and several more wrong detections of non-taught symbols.

This sort of low performance on a real image is most likely caused by the training dataset not properly representing the real cases. Whether this is due to lack of data, or data quality, or method of data generation has to be tested in the future.

In the future, possible fixes would be to manually label hundreds of symbols on hundreds of real images and use a non-generated dataset, and including non-symbols in the training data so the model knows also what not to detect. A completely different approach altogether would be to use YOLO as a localisation tool only, teaching a model to detect any symbol at all, and then passing the cropped symbol to a different network for detection.

Conclusion

This project sought to improve military communications by helping humans convey battle plans, represented with symbols, to each other faster.

Issues with data scarcity were overcome by creating scripts to augment existing data and generate more artificially.

In the beginning, many different solutions were tested as a preliminary sort of ‘dipping our toes into the water’. It was discovered that due to lack of data and possible overfitting, a simple KNN could not solve this problem. A custom CNN was tested next, and showed promise when applied to single-symbol images, although overfitting could have been the cause of the success.

In an attempt to achieve both localisation and classification on a multi-symbol image YOLOv5 was selected as a suitable framework and models were trained across multiple hardware and hyperparameter setups. The applied results of the best performing model, by metrics, turned out to be good when used in ideal cases, but severely lacking in real cases.

For future improvements we suggest creation of a better dataset for training more accurate to the real world, and perhaps even using YOLOv5 in concert with a different model to leverage different strengths of different networks.

Sources

[1] Human speech may have a universal transmission rate: 39 bits per second

[2] NATO Joint Military Symbology

[3] The MNIST Database

[4] AlexNet Paper

[5] YOLOv5 GitHub

[6] Roboflow

[7] YOLOv4 paper

--

--

Erik Kõiv
CodeX
Writer for

Python, Machine Vision, Machine Learning, ROS, ROS2, Embedded, Electronics Design. Reach me at: erkoiv@ut.ee