Semantic segmentation using an optical computer

Published in

Optalysys

22 min readSep 17, 2021

Convolutional neural networks have many uses; we’ve covered some of these in detail, from the conventional through to somewhat more unusual applications. Along the way, we’ve been examining how we can disrupt this field through the use of a technology that allows us to apply a useful piece of mathematics, the convolution theorem, to wildly shift the computational effort and power consumption involved in executing these networks.

In this article, we will take a look at very popular and powerful application of convolutional networks, semantic segmentation, which is one of the most challenging yet useful computer vision tasks.

U-Net with **Optical** Efficient Net encoder (left is the model output, right is the ground truth)

There are many different approaches taken by architectures that perform semantic segmentation. However, most of these models have one thing in common; they have an encoder-decoder structure which utilises convolutional layers in both the feature extraction and up-sampling stages.

The motivation for this article follows the same reasoning as our previous pieces on AI. Performing convolution operations in the reciprocal domain of the data reduces the computational complexity of these operations from quadratic O(n²) to linear O(n) (where n is the input size). The Fourier transform is a mathematical tool that allows us to convert data into its reciprocal form; as such, when paired with fast AI acceleration hardware such as a GPU or TPU, using the Fourier transform can notionally provide a dramatic reduction in the workload, effectively allowing more convolutions to be performed in the same time and using the same hardware.

However, while GPUs typically already contain hardware designed to compute Fourier transforms, the electronic approach to this calculation (regardless of specific device architecture) is bound by a fundamental O(n log n) algorithmic complexity in time. As a means of accelerating convolutions, the electronic Fourier transform would always be a bottleneck in the processing pipeline. This approach to acceleration therefore only works if you have a system that can somehow surpass the limits of the electronic FFT and deliver transformed data at a rate that can keep up with the linear complexity scaling of convolution in the Fourier domain.

When paired with a GPU, the optical approach to Fourier transform calculation can provide this capability; we’ve laid out the essentials of how this works in another article, but how much additional performance can this technique actually provide? This article aims to explore image segmentation networks in detail and provide an indication of how the structure of these networks can, when paired with the above reasoning, see some very significant advantages.

We see a future where optical hardware can work in tandem with a whole range of electronic processors, providing the end user with faster and more efficient computer vision processing solutions. Of course, the properties of the optical Fourier transform are only half of the story; there’s also a huge range of additional considerations regarding the digital-optical interface and the way an optical system interacts with an electronic host, but we’ll be covering these factors in an upcoming article.

For now, in this article we apply our existing hardware demonstrator system to executing tasks in semantic segmentation. This is an especially interesting use of AI for us, as semantic segmentation is often used in mobile and edge applications such as autonomous vehicles and drones, cases where the intensive processing needs of AI often run up against factors such as weight and power consumption. As our system is a micro-scale system that uses vastly less power per operation than an electronic system, we see the combination of greater energy efficiency and fundamental advantages in convolution processing as offering an opportunity for a significant shift in the capabilities of autonomous systems.

Optalysys Beta Program

In order to realise this ambition, we are interested in working with third parties to create next-generation AI and encryption systems leveraging the optical Fourier transform. Access to the Optalysys optical system is not yet open to everyone, but for those interested, please contact us via www.optalysys.com (or by mail to info@optalysys.com), for enquiries about the beta program for bench-marking and evaluation.

Article contents

What is semantic segmentation?

Computer vision researchers have been most interested in solving the tasks of image classification, object detection and segmentation, listed in order of increasing difficulty.

Segmentation itself comes in several forms, semantic, instance and panoptic segmentation, and they all have slightly differing objectives.

Items in an image that could possess more than 1 countable instance (person, pets, motorbike, car) are called ‘things’ in most academic papers, whereas amorphous regions that are harder to count (pavement, sky, dirt) are called ‘stuff’.

In semantic segmentation, every pixel of an image (whether it belongs to ‘things’ or ‘stuff’) is assigned with its associated class label. It treats multiple objects of the same class as a single entity. However, in instance segmentation, different objects of the same class are treated as distinct individual instances. Given the nature of this task, instance segmentation is only concerned with pixels that are deemed to be ‘things’ (as seen in the diagram above).

Panoptic segmentation is a combination of both instance and semantic segmentation, in which two labels are assigned to every pixel of an image — a class label and an instance id. Pixels that lie within ‘stuff’ regions are assigned an instance id of ‘None’.

Applications

Self-driving cars

Healthcare AI

Drones

Automatic control

Remote monitoring/precision agriculture

(And much more…)

How do different architectures work?

While we focus here on semantic segmentation, there are a range of models and applications built on similar principles. For example, most models used for instance/panoptic segmentation and single-shot detection build on the ideas discussed here. A few examples of such models include Mask-RCNN/PANet for instance segmentation, Panoptic FPN/EfficientPS for panoptic segmentation and the widely deployed YOLOv3 for single-shot object detection. These variants are also heavily convolution-based and can therefore be accelerated by Optalysys hardware too.

Decoder-Encoder Networks

The task for the models is to learn from RGB (3 channel) photographs of a scene and their corresponding grey-scale (1 channel) segmentation mask (in which all the pixels have integer values, corresponding to the class label). The outputs of the networks have the same number of channels as there are classes. This raw output can then be either be soft-maxed for a probability distribution or argmaxed for a 1-channel segmentation mask (both operations are along the channel dimension).

One approach would be to naively use several convolutional layers that preserve input resolution using ‘same padding’; however this would be computationally very expensive.

Instead, most models follow an encoder/decoder structure. Within these, the spatial resolution of the input is periodically downsampled, developing lower-resolution feature mappings, which can effectively discriminate between the classes. Then, the feature representations are upsampled into a full-resolution segmentation map.

Encoder

Much like the best classification models, the encoder part of semantic segmentation models make the most of convolutions’ ability to capture and extract location invariant features.

Most architectures therefore use the best such feature extractors, which are the convolution layers of state-of-the-art models designed for classification. Historically, a VGG variant would have been preferred but there are now better alternatives available — a comprehensive comparison can be found here.

Of course, the choice of encoder is very application specific and extends beyond the model’s top-1/top-5 metrics on ImageNet. Most importantly, considerations on training/inference times, memory usage and the number of parameters are crucial. Some of the most widely used encoders include VGG-16, ResNet, MobileNet and EfficientNet variants.

Decoders

Decoders are where most segmentation models differ and take novel approaches. The main role of the decoder stage is to upsample the outputs of the encoder back to the original image size, whilst preserving structural details.

When deciding which models we wanted to discuss and implement in this article, we were motivated to showcase the variety of models, both in terms of use-cases and architectures, that use convolutions. As such, we opted for Fully-Convolutional Network (FCN), U-Net and SegNet.

A note on Transposed convolution

Transposed convolution, also confusingly known as ‘Deconvolution’ (it is not the inverse of a convolution operation), is the method of learnable up-sampling used in FCNs and in U-Nets. Transposed convolutions are characterised by padding (p), stride size(s) and kernel size (k)—here, they are the parameters that would result in a hypothetical, regular convolution on the output of the tranposed convolution to produce a map that has the same spatial dimensions as the input.

The following equation characterizes regular convolutions.

The following equations characterize transpose convolutions.

An intermediate feature map is created to achieve this spatial transformation and this is simply the input feature map with z zeros inserted between columns and rows. This intermediate map is then zero padded by p’ , then convolved as usual, using learnable filters, with k size kernels and s’ stride size.

The following equations show how the parameters for the above operations on the intermediate feature map are calculated, from the assigned parameters, p, s and k.

This process is illustrated in the transposed convolution shown below. In this case, a 3x3 input map (blue) has been expanded with zeros and padded, then convolved to produce the higher resolution 5x5 output (green).

(More details and visualizations on transposed convolutions can be found in this excellent article).

FCNs

FCNs, introduced by Long et al., often use a VGG16 encoder and as such, the diagram below depicts this, though in principle a wide range of any encoders could be used.

The main idea behind the FCN variants (FCN-8, 16 and 32) is the element-wise addition (much like the residual connections in ResNet) between the outputs of the encoder layers and the decoder layers. This allows the underlying structure of the image to be preserved despite the down-sampling through the several max-pooling stages. The variants differ by the layers that are added together and the number of such additions that take place, in turn this affects their ability to distinguish object boundaries. The best of the three models, with regard of boundary delineation, is therefore the FCN-8.

U-Net

U-Net, introduced by Ronneberger et al., is very popular architecture and is extensively used for biomedical applications. The same general encoder-decoder structure is observed here, however U-Nets differ in the method they use to preserve structure: they add (concatenate) the channels between every corresponding encoder and decoder layer. These are known as ‘skip connections’. There are many variants of the architecture, with different encoders and depths (level of down-sampling, e.g 16x or 32x). The variant we will go on to implement, which uses a VGG-16 encoder, is shown below.

U-Net architectures have many applications beyond segmentation, including super resolution and other image processing techniques (e.g image recolorization).

SegNet

The novel idea behind SegNet, introduced by Badrinarayanan et al., is to only store the max-pooling indices (the positions of the maximum value in each pooling window) at each pooling stage in the encoder. The indices, in principle, require only 2 bits for each 2 ×2 pooling window and it is therefore much more memory efficient than storing feature maps in float precision. This makes the architecture a great fit for applications where memory is limited during training/inference and where a loss in accuracy can be foregone. The model is shown below, with a VGG16 encoder.

These indices help preserve the structure and help with boundary de-localization (much like the element wise addition and concatenation in the other architectures). The indices are used at each corresponding decoder stage for unpooling, the method of non-learnable upsampling in this architecture.

The resultant sparse feature maps are then passed through the convolutional layers within the the decoder stages.

Pytorch Implementation

To demonstrate the segmentation models described above, we have electronically implemented them all in Pytorch, all with VGG16 encoders.

Dataset and pre-processing

Given the increasing usage of remote monitoring and autonomous flight/landing procedures, we opted to develop a model to semantically understand urban scenes during drone flight. For this, we used the Semantic Drone Dataset from http://dronedataset.icg.tugraz.at/.

The images in the datasets contain over 20 houses from bird’s eye view and 23 classes in total (details of the classes present are found here), all acquired at an altitude of 5 to 30 meters. There are 400 publicly available jpg images and corresponding png segmentation masks (in which all the pixels hold values from 0–22). For the purposes of this article, the segmentation masks shown are RGB visualisations of the grey-scale masks (the details of the colours used for each class is found in the ‘class_dict_seg.csv’ file in the dataset).

An example image-mask pair from the dataset:

The images are very high resolution with dimensions of 6000x4000px (24Mpx) and therefore we cropped each image and mask into six, non-overlapping 2000x2000px images, giving 2400 images and masks. This allowed us to resize the images (to 512x512px) before passing it through our model during training/inference, without losing as much information.

Challenges

One of the pitfalls of this data-set (and most urban scene datasets e.g CamVid) is the class imbalance. For instance, there are 2700 times as many ‘paved-area’ pixels as ‘dog’ pixels!

To compound this, there are only 2400 images (which is split 80/20 for train/test) in the data-set, which is relatively small — especially to gain a semantic understanding of 23 classes. It is also challenging for the model to learn the different classes given the differing sizes of objects of the same class in different images (as the images are taken at different altitudes). Of course, CNNs are translation invariant, but the variance in scale and rotation of the objects in this dataset make it more challenging.

The images below showcase the above point.

Example images from data-set containing ‘person’ class pixels.

Addressing the challenges

Transfer learning

We opted to initialize the weights of the VGG encoder with ImageNet weights for faster initial learning and (hopefully) a higher final accuracy. However, given the 512x512px (ImageNet contains 224x224px images) input size for the network and aerial perspective, it did not make sense to freeze the network as the training domain differed too much from this application (Note: we still had to ensure to preprocess the images based on the normalization used on ImageNet, mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]).

Loss functions

The two loss functions investigated were weighted cross-entropy and dice loss, given their effectiveness at dealing with data imbalances.

Weighted cross-entropy

Cross entropy is the preferred loss function for semantic segmentation, given it’s numerical stability and well-behaved gradients during training and back-propagation (the mathematical details can be found here).

To address the data imbalance, the weights determining each class’ contribution to the loss was determined using median frequency balancing. In this method, the weight assigned to a class is the ratio of the median of class frequencies (computed on the training set) divided by the class frequency.

Dice Loss

One of the issues with weighted cross entropy is that it is still a pixel-wise loss and acts only as a ‘proxy’ for intersection over union (IOU), one of the most perceptive metrics of semantic segmentation model performance. This can be addressed in some datasets/models and performance can be improved by using dice loss, which is given by the following equations:

Dice score is mathematically very similar to IOU:

When training, the loss is calculated by averaging the dice loss over all the classes (we used the dice loss implementation from this repo).

As dice score and IOU are positively correlated, training on dice loss could result in a much better IOU performance. However, gradients with dice loss are sometimes observed to blow up and this leads to training instability, so it must be tested.

There are many other loss functions that could have also been tested, including Tversky, Lovasz, focal and boundary losses.

Group Normalization

When setting the input size to 512x512px, we found that the batch size had to be less or equal to than 8 for all three models, as limited by the GPU memory. When the batch-size is this small, there is no great benefit in using batch normalization (though it can still be useful by providing a regularization effect).

Therefore, we opted for group normalization. The performance of group and batch normalization over a range of batch sizes was evaluated in a recent paper and a graph summarizing the findings is shown below.

Group norm. does not exploit the batch dimension, instead it normalizes over a ‘group’ — a fixed amount of channels.

Comparison of the normalization methods (Source)

From the image above it can be seen that group norm. acts as a middle ground between instance and layer norm. We opted for a standard group size of 32 channels.

Over-fitting

There are several methods we could have employed if the model started to over-fit and struggle on the validation dataset. The main method was random transforms, using the Python module ‘Albumentations’, such as random flips, rotations, hues, gaussian blur and so on. Regularization methods specific for segmentation models, such as label smoothing, could also be used to reduce variance and improve generalization.

Code

VGG Encoder

VGG16 Encoder

We implemented our own VGG16 encoder— which allows us to conveniently pull the encoder outputs of different resolutions.

FCN-8

Assembled FCN-8

SegNet

SegNet decoder class

Assembled SegNet

U-Net

U-Net parts

Assembled U-Net

Evaluation metrics

Unlike with classifiers, where the model is either correct or incorrect, evaluating the performance of segmentation models is much more nuanced. The most widely used metrics include mIOU, pixel/class-averaged pixel accuracy, dice score, focal score and distance-based metrics (see Hausdorff Loss).

The need for mIOU is demonstrated on a binary segmentation task below — where black and red pixels represent the two classes. Despite the prediction entirely misidentifying the red class, the 40 x 40px square — it still achieves a 99% overall pixel accuracy. However, the mIOU score is only 50% — (100% IOU for the black class and 0% for the red). Using only the pixel accuracy as a performance indicator, we might have misplaced confidence in the model, which clearly under-performs on the red class.

To avoid misleading cases such as the one above, the evaluation metrics we opted to use were mIOU and the pixel accuracies (overall and class-averaged).

Function returning mIOU and pixel accuracies

Training

During training and validation, 1-channel, 512x512px images were used as the ground-truth segmentation masks (the cropped 2000x2000px masks were downsized using the nearest-neighbour algorithm). Inputs to the network were also created by down-sampling the cropped images, to 3-channel 512x512px images (also using the nearest-neighbour algorithm). Electronic training took place on an NVIDIA Quadro P6000.

To decide the best loss function for this dataset, we ran a simple test using a batch size of 1 and a standard learning rate of 0.0001 on Pytorch’s Adam Optimizer. All three models were trained on both loss functions for 10 epochs and the graphs were compared. The results were decisively in favor of the dice loss for all three of the models — there did not seem to be any issues with the gradients. An example of the results from one of these tests is shown below (FCN-8):

Weighted Cross Entropy (mIOU at Epoch 10: 33)

After the best loss function was determined, we ran a larger test, using python module RayTune. This test tried to determine the ideal learning rate and the batch-size together, given there interdependence. An example of such results is shown below:

All the three models had similar optimal learning rates of magnitude 10e-5. There was no significant impact of using different batch-sizes when the results scripts were examined but the ideal batch size for all of them was 1 (this is surprising but might be due to small variance within the dataset).

The interdependence of even more loss functions, learning rates, batch-sizes and even choice of optimiser would have been tested in an elaborate large optimisation run for all three models but this fragmented optimisation approach provided good results in much less time!

Results

Early stopping, based on validation mIOU, was implemented for all three models to stop the model from over training. After some testing, we found that adding many random transforms, to the train set, reduced performance, therefore we opted to only apply rotations / flips to the images.

The models were trained for 100 epochs and the results are shown below.

FCN-8 (Key metrics — mIOU: 61.1 , pixel accuracy: 91.2 , average pixel accuracy: 69.9)

U-Net (Bilinear variant) (Key metrics — mIOU: 60.5, pixel accuracy: 87.5, average pixel accuracy: 69.8)

SegNet (Key metrics — mIOU: 47.1, pixel accuracy: 78.1 , average pixel accuracy: 56.9)

Below are GIFs of the best masks produced by the best performing checkpoint of each of the above architectures.

(The left hand side is the model output and the right hand side is the ground truth)

FCN-8

U-Net

SegNet

Some not so good ones…

FCN-8

UNet

SegNet

Discussion

Admittedly, the models were only trained for 100 epochs and some performance would be gained by training further, particularly in SegNet. However, using a rough rule that an mIOU of 50% is a ‘good prediction’, the results are still impressive on this small, challenging dataset. The FCN and U-Net, as expected, outperforms the SegNet — given that they store a lot more detail about each encoder stage.

It is interesting to visualize some of the images in the test split that the models perform worst on. For all the models, a few of the most incorrectly labelled classes included ‘water’ and ‘bald-tree’. These classes are notoriously challenging from an aerial perspective as there is often visible, defined rocks under the water and gravel/paved area/grass visible through the branches of the bald trees. The examples below help show that some of the image-mask pairs would even challenge humans!

If the models were being trained for deployment, more performance could be squeezed out with a larger optimisation space for RayTune, longer training coupled with regularization techniques such as label-smoothing and of course, a larger data-set.

The optical benefit

At the start of this article, we laid out our rationale for how we can accelerate convolution operations, and gave a picture of the benefits that fundamentally less complex Fourier transform calculation can yield.

These benefits are based on the ease of carrying out convolutions in the reciprocal or Fourier domain, so not all network architectures are equal! Any convolution-heavy network can see major benefits from the Fourier-optical approach, but the most significant benefits are reserved for architectures which are especially heavy on convolutions.

As our above exploration of image segmentation networks shows, these networks are particularly dense with convolutions, which means that the optical approach should yield performance gains close to the maximum.

In the following bar chart we outline the reduction in the total number of discrete MACs required in the convolutional layers of the segmentation models implemented above. There are significant gains to be seen throughout, given the models’ use of many 3x3 convolutions (particularly with a high number of channels).

As discussed before, given the similarities in architecture and concept, similar advantages for the optical approach can also be achieved in other encoder-decoder computer vision models such as YOLOv3.

MAC operations required to execute a single inference using both pure electronic processing systems and an optical-electronic hybrid approach. While the batch size may seem unreasonably small (most electronic systems that leverage parallelism as a means of processing data need larger batches to ensure maximum efficiency), the very-high-speed serial nature of the optical system means that smaller batch sizes are possible.

Achieving speed-ups in neural network processing is always welcome; most forms of machine learning are well-known for taking a lot of computing resources, but achieving it for image segmentation is especially useful.

This is because the kinds of operational environment in which image segmentation is a common task are also often heavily resource-limited. If you’re using it as, say, part of the autonomous navigation system in a drone, then power consumption, weight and latency are all factors that must be accounted for in addition to the performance of a network.

With respect to these kinds of use-cases, the ability to achieve a fourfold improvement in the primary computational workload is not only useful, it could make a critical difference in the capability of the platform. Longer flight times, better and more reactive navigation, greater imaging detail; these benefits all stem from improvements in speed and efficiency, and all are possible as long as you can successfully integrate the technology that enables this performance into the hardware that powers the application.

That’s why both the physical architecture of the core technology and the engineering effort required to validate systems are so important, and why these things have been an especially significant focus of our work thus far.

We’ll return to these points in later articles, but it’s worth keeping in mind that success in these applications is as much about the how as it is the why. For now, we’ll move on to our implementation of segmentation using an optical approach.

Optical implementation

We have implemented several segmentation models electronically and discussed how our hardware can accelerate them, but we also wanted to implement a segmentation model that is more application specific.

It is possible to envisage an edge-use application of semantic segmentation, on drones, where there would be the need for in-flight inference. Here, it might be preferred to use a less memory-intensive, lighter weight encoder than VGG16. Therefore, we decided to optically implement an EfficientNet-b0 encoder.

EfficientNet architecture

EfficientNet is a family of CNN architectures, in which all the members are uniformly scaled variants (scaled in the depth, width and resolution domain). (See here for more details). The architectures build on the techniques used in previous light-weight models, particularly MobileNetV2.

The performance of the smallest variant in the family (EfficientNet-b0), on ImageNet, is impressive given it only has 5.3 million parameters. It achieves a Top-1 and Top-5 accuracy of 76.3% and 93.2%, respectively. Some of the features that result in such a high efficiency include squeeze and excitation blocks, inverted residual blocks and the swish activation function. More details of these methods can be found in this article).

Code for EfficientNet

Thanks to the Optalysys PyTorch interface, accelerating truly state-of-the-art computer vision models with our unique optical hardware is simple. Every time we see a torch.nn.Conv2d layer in the architecture, we are able to replace this with our optical convolution layer: OptConvLayer.

As group normalization with 32 channels per group isn’t suitable given the channel dimensions of Efficient Net, layer normalization (which is just group normalization, where number of groups = number of channels) was used in both the encoder and decoder.

The EfficientNet code is inspired by https://github.com/romulus0914/EfficientNet-PyTorch/blob/master/efficientnet.py.

Assembled EfficientNet-b0 encoder

Assembled U-Net with Optical EfficientNet-b0 encoder.

Comparison with U-Net with VGG16 encoder

EfficientNet U-Net: 1,992,175
VGG16 U-Net’s Total parameters: 29,652,055

The EfficientNet variant has almost 15x fewer parameters than it’s VGG-16 counterpart, meaning it will undoubtedly have lower performance. However, in some applications where this performance can be foregone — this model might be preferred for the aforementioned reasons.

In other applications where accuracy is vital, other methods such as quantization and mixed accuracy training should be tested with the larger models. If the effect on the accuracy metrics is minimal, these methods provide a way to improve memory usage/inference speeds- without having to change encoder/architecture.

The optical benefit

There are several new look features present in the EfficientNet encoder when compared to a plain CNN, however most of the computational resources are still used for convolutions (e.g. the squeeze and activation blocks add less than 1% in computational cost). Though the convolutions are separable, which already provides a boost in efficiency, there are gains to be had as the architecture still performs 3x3 and 5x5 convolutions, in the depth-wise stage.

A truly optimal configuration would be our hardware in tandem with electronic GPUs, which can be used to compute the lightweight 1x1 convolutions, in the point-wise stages.

Results

The performance of this architecture can be optimized by creating a deeper U-Net (going to 32x dimension reduction, instead of 16x), altering the arbitrarily set channel dimensions of the decoder (e.g by using Bayesian methods) and by tuning the hyper-parameters. These optimizations will likely still result in a model that is much lighter weight than the VGG-16 variant.

For the purpose of this article, however, we would like to simply present the results from the model using dice loss and a learning-rate of 0.0001 on Adam optimizer. The aggressive early stopping stopped the training at the 76th epoch, after 5 successive epochs without an improvement in the validation mIOU.

(Key metrics — mIOU: 45.5, pixel accuracy: 78.1 , average pixel accuracy: 56.9)

Results from U-Net with **Optical** EfficientNet-b0 encoder

To conclude, it is clear that convolutional networks (as a field) are continuing to evolve, as seen in the novel architectures of several networks we have implemented in this article. Indeed, the strengths of convolutions that were historically leveraged for image classification are even more pivotal as we look towards more difficult computer vision tasks.

Using the approach we outlined at the start of this article, our hardware is therefore primed to accelerate the next generation of models — in turn enabling more ambitious, novel applications at edge to enterprise scale, in everything from drones to datacentres. If you are interested in bench-marking and modelling your enterprise’s computer vision workflows on our optical systems — contact us at optalysys.com to find out more about our beta programme.

Semantic segmentation using an optical computer

Optalysys Beta Program

Article contents

What is semantic segmentation?

Applications

Self-driving cars

Healthcare AI

Drones

How do different architectures work?

Decoder-Encoder Networks

Encoder

Decoders

FCNs

U-Net

SegNet

Pytorch Implementation

Dataset and pre-processing

Challenges

Addressing the challenges

Transfer learning

Loss functions

Weighted cross-entropy

Dice Loss

Group Normalization

Over-fitting

Code

VGG Encoder

FCN-8

SegNet

U-Net

Evaluation metrics

Training

Results

FCN-8

U-Net

SegNet

Some not so good ones…

FCN-8

UNet

SegNet

Discussion

The optical benefit

Optical implementation

EfficientNet architecture

Code for EfficientNet

Comparison with U-Net with VGG16 encoder

The optical benefit

Results

Written by adh1s