Table Of Contents:
- Business Problem
- ML Problem Mapping & Metric
- Data Description
- Existing Approaches
- First Cut Solution
- Exploratory Data Analysis
- Model Explanation
- Results & Deployment
- Future Work
The Following is a case study of a kaggle problem.This solution will give top 6% in the leaderboard with free colab gpu in 5hrs, code is written in Pytorch.Colab notebook with code in References.
Build a model where given input of seismic image, the model predicts every pixel as salt or no salt.
Need for Automation?
Several areas of Earth with large accumulations of oil and gas also have huge deposits of salt below the surface.But unfortunately, knowing where large salt deposits are precisely is very difficult. Professional seismic imaging still requires expert human interpretation of salt bodies. This leads to very subjective, highly variable renderings. More alarmingly, it leads to potentially dangerous situations for oil and gas company drillers.
3.ML Problem Mapping & Metric:
This can be posed as a Segmentation type problem.No latency constraints are there in this case.
Performance metric used here is mean averaged precision over multiple IOU Thresholds,IOU stands for Intersection Over Union. The metric sweeps over a range of IOU thresholds, at each point calculating an average precision value. The threshold values range from 0.5 to 0.95 with a step size of 0.05: (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95). In other words, at a threshold of 0.5, a predicted object is considered a “hit” if its intersection over union with a ground truth object is greater than 0.5.
Submission Format: Instead of submitting masks directly for evaluation, run-length-encoding format was used.
mask= np.random.randint(2, size=(101, 101))
dots = np.where(mask.T.flatten() == 1)
run_lengths = 
prev = -2#pixels are 1-indexed so correction required
for b in dots:
if (b > prev+1): run_lengths.extend((b + 1, 0))
run_lengths[-1] += 1
prev = b
rle=' '.join(map(str, run_lengths))
4. Data Description:
Dataset images and masks in .png files,link for data below
- This kernel talks about training using Focal Loss first followed by Lovasz loss,BCE can replace Focal Loss.
- This kernel uses deep supervision to accelerate training to just 60 epochs which should normally take 200 epochs.
- Here Snapshot ensembling and Coord Conv have been used to get a really good score.
6.First Cut Solution:
At first an easy baseline(UNet-Resnet34) was taken.Only necessary Augmentations(Normalize,Totensor())were applied here.Loss was BinaryCrossEntropy,Optimizer Adam was chosen with a learning rate of 3e-4.
7.Exploratory Data Analysis
Train Dataset consists of 4000 images and Test Dataset consists of 18000 images.The competition has very less train data.Possible overfitting throughout the competition.Need for transfer learning and Data Augmentation.
3x101x101 is the size of all the train and test images 101x101 size of train and testset masks.Some anomalies present in the dataset are black images with empty masks,vertical mask as shown below.Yellow part below shows salt.
Test Image Preprocessing involves taking the image and Replicate Padding it to 128x128, and Normalising the image channelwise to ((0,0,0),(1,1,1)).128 size is important as the solution involves taking in UNet which is easy to work with on power 2 dimensions.
Salt Coverage Computation: Salt Coverage seemed to be a very important variable to focus upon thus every mask in the train dataset has been computed for salt coverage in it.
#rounding this value to int for easy validation
#values in salt Coverage= [0,1,2,3,4,5,6,7,8,9,10]
SCSE blocks are supposed to add Attention Mechanism to Convolutional Networks.Attention in a broad sense in nothing but focussing on some things and not all.This is achieved here by adding parameters to judge which Spatial Pixel and which Channel are to be focussed upon.
In Conv networks as we go deep we will be left with high level features, in early layers we will find low level features.Hypercolumns on the other hand are supposed to give much more predictive power to the Network by giving it access to all the lower level along with higher level details in a Conv Network.They achieve this by taking output of every stage, upsampling it to target size and concatenating everything and passing to the last Convolution.
CrossValidation:Stratified Split has been used with 10fold , and stratify has been done using the salt coverage of the image.
Baseline was performing great but very fastly overfitting.UNet with Resnet34 involves using Resnet34 as encoder and taking the output before every downsample in resnet 34 and connecting them to the decoder before respective upsampling steps in decoder.Decoder was kept fairly simple with 64 channels throughout.
So Different Augmentations were used for train and val. This made the train loss to be worse than val loss at every part of training.After this the training was easy as heuristic was to train long enough that train loss is better than val loss.This seemed to correct the issue of overfitting to some extent.Dropout can seem to be a very natural solution to overfitting but i avoided dropout cause it has significantly delayed training process, providing no great outcome compared to these augmentations.
transform_train = Compose([
Then Lovasz loss which is a surrogate loss function for IOU has been optimised.But training on this loss has been very slow , thus to accelerate the training process i have pretrained weights of Baseline_corrected and this did prove to help accelerate the training process by atleast 80 epochs.Here Adam with 1e-4 learning rate has been used.Lr Scheduler also has been helping the training process it has max_lr of 1e-3 and min_lr of 1e-4 with 10 epochs per cycle and ‘triangular2' lr mode.
Next model was Baseline with Spatial and Channel Squeeze and Excitation (SCSE) blocks and Hypercolumns.SCSE Blocks are placed only in the decoder and Upsampling is used in Hypercolumns with bilinear mode.The reduction factor in SCSE Block was kept default 16.Decoder here also used upsampling instead of ConvTranspose2d.
This Network showed only little improvement.Then Error analysis on the network is performed.The analysis showed that a considerable amount of error is due to networks inaccuracies on predicting whether salt is present or not.Around 11 False Positives were present along with 24 False Negatives in 0.0 IOU Predictions.
Next model tried a simple solution on the same architecture that was to judge at the center block whether there is salt or no salt in the image.ie. Take out the output from the encoder and check for salt or no salt in there, and then only those with non empty masks will be trained for semantic segmentation.This is sort of hard attention as we are multiplying with 1,0 the targets.The training was done for 200 epochs,30 epochs with BCE loss and then with Lovasz Loss.Adam 3e-4 for BCE and then Adam 1e-4 for 90 epochs then with Cyclic Learning rate (1e-3,1e-4) for 80 epochs.The Binary Classifier on the center Block was 92% accurate on the val set whereas the Metric was .87 on val set.
The loss calculation for this model is to take the output and multiply it with 0 if the image has no salt,1 if salt is present. then this output will be used to calculate lovasz loss and for classification loss(BCE) every image will be used.
This solution improved the overall training process a lot. The convergence was very fast here and doing this helped the FalsePositives to go from 11 to 6, and it also helped the training of non-empty images.
10.Results & Deployment:
In Above figures legend is the last pic,Orange -> True Positive,Grey ->False Negative,Red -> False Positive,Blue-> True Negative.These images have been made by overlapping the predictions over the true masks, this process is only done on masks in val (fold0) set with metric ==0.0
- Figure 2 clearly shows less errors compared to Figure 1, this is obvious as Figure2 was from the best model.
- High numbers of errors are coming with blue and grey(true-negative,false-negative).
- Also many errors are made in the corners and edges.Metric calculation when done on crop of targets(90,90) was way better than the actual metric.
- The models also seem to be confused between a complete mask and no mask.
- Vertical masks as expected were causing problems.
In this table iou column represents the metric groups, the 2nd column is the count of the whole dataset in that metric group.
- Table1 is from the best model, it clearly shows overall improvement compared to Table2.
- This is interesting as the change to add binary judge to the model is meant to only influence the 0.0 iou range(decreased) but here 1.0 iou range got better (increased) as well.
The Best model has been deployed on google cloud, the deployment process was kept fairly simple, It was done using flask.
- Pseudo labelling looks very promising in this case owing to such a small train set.
- KFold ensemble will definitely improve by atleast .01 IOU.
- Snapshot Ensembling or Stochastic Weight Averaging will also definitely improve the score.
- DeepSupervision can be done in this network to have better convergence.
- CoordConv solution has been known to improve scores a bit.