Neural nets on seismic data: thoughts on loss function selection and tuning

Published in

Seismic data analysis using AI

16 min readFeb 18, 2020

In our previous article, we described an experiment to determine the minimum amount of manually marked seismic cube sections for training a neural network using seismic data. Today we continue this topic by choosing the most appropriate loss function.

Two basic classes of functions are considered — Binary cross entropy and Intersection over Union — in 6 variants with selection of parameters, as well as combinations of functions of different classes. Additionally, the regularization of the loss function is considered.

Spoiler Alert: Network prediction quality increased dramatically.

Business goals

We will not repeat the description of the specifics of the seismic survey, the data obtained and the tasks of their interpretation. All this is described in our previous article.

The idea of this study was prompted by the results of the competition for the search for salt deposits on 2D slices. According to the participants of the competition, in solving this problem, a whole zoo of various loss functions was used, moreover, with quite different results.

Therefore, we asked ourselves: is it really possible that for such problems on such data loss function selection can give a significant gain in quality? Or is this characteristic true only for the conditions of the competition, when there is a struggle for the fourth or fifth decimal place for the metrics predefined by the organizers?

Typically, in tasks solved with the help of neural networks, the tuning of the learning process is based mainly on the experience of the researcher and some heuristics. For example, for the problems of image segmentation, loss functions that are most often used, based on assessing the coincidence of the shapes of recognized zones, the so-called Intersection over Union.

Intuitively, based on the understanding of behavior and research results, these types of functions will give better results than those that are not sharpened for images, such as cross-entropy ones. Nevertheless, experiments in search of the best option for this type of task as a whole and each task individually continue.

Seismic data prepared for interpretation have a number of features that can have a significant impact on the behavior of the loss function. For example, the horizons separating the geological layers are smooth, more sharply changing only in the places of faults. In addition, the distinguished zones have a sufficiently large area relative to the image, i.e. small spots on interpretation results are most often considered a recognition error.

As part of this experiment, we tried to find answers to the following local questions:

Is the loss function of the Intersection over Union class really the best result for the problem considered below? It seems the answer is obvious, but which one?And is really the best from the business point of view?
Is it possible to improve the results by combining functions of different classes? For example, Intersection over Union and cross-entropy with different weights.
Is it possible to improve the results by adding to the loss function various additions designed specifically for seismic data?

And to a more general question:

Is it worth bothering with the selection of the loss function for seismic data interpretation tasks, or is the gain in quality not comparable with the time loss for conducting such studies? Maybe it’s worth intuitively choosing any function and spending energy on the selection of more significant training parameters?

General experiment and data description

For the experiment, we took the same task of isolating geological layers on 2D slices of a seismic cube (see Figure 1).

Figure 1. Example of a 2D slice (left) and the result of marking the corresponding geological layers (right)(source)

And the same set of completely labeled data from the Dutch sector of the North Sea. The source seismic data are presented at Open Seismic Repository: Project Netherlands Offshore F3 Block. A brief description can be found in the article Silva et al. «Netherlands Dataset: A New Public Dataset for Machine Learning in Seismic Interpretation».

Since in our case we are talking about 2D slices, we did not use the original 3D cube, but the already made “slicing”, available here: Netherlands F3 Interpretation Dataset.

During the experiment, we solved the following problems:

We looked at the source data and selected the slices, which are closest in quality to manual marking (similar to the previous experiment).
We recorded the architecture of the neural network, the methodology and parameters of training, and the principle of selecting slices for training and validation (similar to the previous experiment).
We chose the studied loss functions.
We selected the best parameters for the parameterized loss functions.
We trained neural networks with different functions on the same data volume and chose the best function.
We trained neural networks with different combinations of the selected function with functions of another class on the same amount of data.
We trained neural networks with regularization of the selected function on the same amount of data.

For comparison, we used the results of a previous experiment, in which the loss function was chosen exclusively intuitively and was a combination of functions of different classes with coefficients also chosen “quasi-randomly”.

The results of this experiment in the form of estimated metrics and predicted by the networks of slice masks are presented below.

Problem 1. Data selection

As initial data, we used ready-made inlines and crosslines of a seismic cube from the Dutch sector of the North Sea. As in the previous experiment, simulating the work of the interpreter, for training the network, we chose only clean masks, having looked at all the slices. As a result, 700 crosslines and 400 inlines from ~ 1600 source images were selected.

Problem 2. Experiment parameters

This and the following sections are of interest, first of all, for specialists in Data Science, therefore, appropriate terminology will be used.

For training, we chose 5% of the total number of slices, moreover, inlines and crosslines in equal shares, i.e. 40 + 40. Slices were selected evenly throughout the cube. For validation, 1 slice was used between adjacent images of the training sample. Thus, the validation sample consisted of 39 inlines and 39 crosslines.

321 inline and 621 crossline fell into the validation dataset, on which the results were compared.

Similar to the previous experiment, image preprocessing was not performed, and the same UNet architecture with the same training parameters was used.

The target slice masks were represented as binary cubes of dimension HxWx10, where the last dimension corresponds to the number of classes, and each value of the cube is 0 or 1, depending on whether this pixel in the image belongs to the class of the corresponding layer or not.

Each network forecast was a similar cube, each value of which relates to the probability that a given image pixel belongs to the class of the corresponding layer. In most cases, this value was converted into the probability itself by using a sigmoid. However, this should not be done for all loss functions, therefore activation was not used for the last layer of the network. Instead, the corresponding conversions were performed in the functions themselves.

To reduce the influence of the randomness of the choice of initial weights on the results, the network was trained for 1 era with binary cross-entropy as a function of losses. All other training started with these weights received.

Problem 3. Loss function selection

For the experiment, 2 basic classes of functions were selected in 6 versions:

Binary cross entropy:

binary cross entropy;
weighted binary cross entropy;
balanced binary cross entropy.

Intersection over Union:

Jaccard loss;
Tversky loss;
Lovász loss.

A brief description of the listed functions with code for Keras is given in the article.

Here we present the crucial points with corresponding links (where possible) to a detailed description of each function.

For our experiment, the consistency of the function used during training is important with the metric by which we evaluate the result of the network forecast on the delayed sample. Therefore, we used our code implemented in TensorFlow and Numpy, written directly using the formulas below.

Following designations are used:

pt — for binary target mask (Ground Truth);
pp — for forecasted mask

For all functions, unless otherwise specified, it is assumed that the network prediction mask contains probabilities for each pixel in the image, i.e. values in the interval (0, 1).

Binary cross entropy

Description: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

This function seeks to bring the network forecast distribution closer to the target, penalizing not only erroneous predictions, but also uncertain ones.

Weighted binary cross entropy

This function coincides with binary cross-entropy with a beta value of 1. It is recommended for strong class imbalances. For beta> 1, the number of false negative forecasts (False Negative) decreases and the completeness (Recall) increases, for beta <1 the number of false positive forecasts (False Positive) decreases and the accuracy increases (Precision).

Balanced binary cross entropy

This function is similar to weighted cross-entropy, but it corrects the contribution of not only “ones”, but also zero values of the target mask. Coincides (accurate to a constant) with binary cross-entropy for the value of the coefficient beta = 0.5.

Jaccard loss

The Jaccard coefficient (aka Intersection over Union, IoU) determines the measure of the “similarity” of the two areas. Dice index does the same thing:

It makes no sense to consider both of these functions. We chose Jaccard.

For the case when both areas are defined using binary masks, the above formula can easily be rewritten in terms of the values of the masks:

For non-binary forecasts, optimization of the Jaccard coefficient is a non-trivial task. We will use the same formula for probabilities in the forecast mask as a certain imitation of the initial coefficient and, accordingly, the following loss function:

Tversky loss

Description: https://arxiv.org/pdf/1706.05721.pdf

This function is a parameterized version of the optimization of the Jaccard coefficient that coincides with it at alpha = beta = 1 and with the Dice index at alpha = beta = 0.5. For other non-zero and non-coincident values, we can shift the emphasis towards accuracy or completeness in the same way as in the functions of weighted and balanced cross-entropy.

The emphasis shift problem can be rewritten using a single coefficient lying in the interval (0, 1). The resulting loss function will look like this:

Lovász loss

It is difficult to give a formula for this function, since it is an option for optimizing the Jaccard coefficient by an algorithm based on sorted errors.

Function description can be found here, one of code implementations — here.

NB!

To simplify the comparison of values and graphs hereinafter, under the term “Jaccard coefficient” we will further understand “one minus the coefficient itself”. Jaccard loss is one way to optimize this ratio, along with Tversky loss and Lovász loss.

Problem 4. Parameters tuning for parameterized loss functions

To select the best loss function on the same data set, an evaluation criterion is needed. So, we chose the average / median number of connected components on the resulting masks. In addition, we used the Jaccard coefficient for predictive masks converted to single-layer argmax and again divided into binarized layers.

The number of connected components (i.e., solid spots of the same color) on each forecast obtained is an indirect criterion for assessing the volume of its subsequent refinement by the interpreter. If this value is 10, then the layers are selected correctly and we are talking about a maximum of minor horizon correction. If there are just few, then you only need to “clean” the small areas of the image. If there are substantially more of them, then everything is bad and may even need a complete re-markup.

The Jaccard coefficient, in turn, characterizes the coincidence of image zones assigned to one class and their boundaries.

Weighted binary cross entropy

According to the results of the experiments, the parameter beta = 2 was chosen:

Figure 2. Comparison of the quality of the network forecast according to the main loss function and selected criteria

Figure 3. Statistics for the number of connected components in terms of beta parameter values

Balanced binary cross entropy

According to the results of the experiments, the parameter beta = 0.7 was chosen:

Figure 4. Prediction quality comparison according to main loss function and selected parameters

Figure 5. Statistics for number of connected components

Tversky loss

According to the results of the experiments, the parameter beta = 0.7 was chosen:

Figure 6. Prediction quality comparison according to main loss function and selected parameters

Figure 7. Prediction quality comparison according to main loss function and selected parameters

From the above figures, two conclusions can be drawn.

First, the selected criteria correlate fairly well with each other, i.e. the Jaccard coefficient is consistent with an estimate of the volume of necessary refinement. Secondly, the behavior of the cross-entropy loss functions is quite logically different from the behavior of the criteria, i.e. using training only this category of functions without additional evaluation of the results is still not worth it.

Problem 5. Loss function selection

Now let’s compare the results that showed the selected 6 loss functions on the same data set. For completeness, we added the predictions of the network obtained in the previous experiment.

Figure 8. Comparison of prediction results of different networks with different loss functions with selected criteria

Table 1. Average criteria values

Figure 9. Comparison of network forecasts by the number of predictions with the specified number of connected components

From the presented diagrams and tables, the following conclusions can be drawn regarding the use of “solo” loss functions:

In our case, the “Jaccard” functions of the Intersection over Union class really show better values than cross-entropy ones. Moreover, significantly better.
Among the selected approaches to optimizing the loss function based on the Jaccard coefficient, Lovazh loss showed the best result.

Let us visually compare the forecasts for the slices with one of the best and one of the worst Lovazh loss values and the number of connected components. The target mask is displayed in the upper right corner, the forecast obtained in the previous experiment in the lower right:

Figure 10. Networks predictions for one of the “best” slices

Figure 11. Networks predictions for one of the “worst” slices

It can be seen that all networks work equally well on easily recognizable slices. But even on a poorly recognizable slice where all networks are wrong, the forecast for Lovazh loss is visually better than the forecasts of other networks. Although it is one of the worst losses for this function.

So, at this step, we have decided on a clear leader — Lovazh loss, the results of which can be described as follows:

about 60% of forecasts are close to ideal, i.e. require no more than just few adjustments to individual sections of the horizons;
approximately 30% of forecasts contain no more than 2 extra spots, i.e. require minor improvements;
approximately 1% of forecasts contain from 10 to 25 extra spots, i.e. requires substantial improvement.

At this step, just replacing the loss function, we achieved a significant improvement in the results compared to the previous experiment.

Can it still be improved by a combination of different functions? Let’s check it out.

Problem 6. Selecting the best loss functions combination

The combination of loss functions of various nature is used quite often. However, finding the best combination is not easy. A good example is the result of a previous experiment, which turned out to be even worse than the “solo” function. The purpose of such combinations is to improve the result by optimizing according to different principles.

Let’s try to sort through different options of the function selected in the previous step with others, but not with all in a row. We confine ourselves to combinations of functions of different types, in this case, with cross-entropy ones. It makes no sense to consider combinations of functions of the same type.

Total, we checked 3 pairs with 9 possible coefficients each (from 0.1 \ 0.9 to 0.9 \ 0.1). In the figures below, the x-axis shows the coefficient before Lovazh loss. The coefficient before the second function is equal to one minus the coefficient before the first. The left value is only a cross-entropy function, the right value is only Lovazh loss.

Figure 12. Evaluation of the forecast results of networks trained on BCE + Lovazh

Figure 13. Evaluation of the forecast results of networks trained on WBCE + Lovazh

Figure 14. Evaluation of the forecast results of networks trained on BBCE + Lovazh

It can be seen that the selected “solo” function was not improved by adding cross-entropy. A decrease in some values of the Jaccard coefficient by 1–2 thousandths may be important in a competitive environment, but does not compensate for a business impact in the criterion for the number of connected components.

To verify the typical behavior of a combination of functions of different types, we conducted a similar series of training for Jaccard loss. Only one pair managed to slightly improve the values of both criteria at the same time:

0.8 * JaccardLoss + 0.2 * BBCE

Average of connected components: 11.5695 -> 11.2895

Average of Jaccard: 0.0307 -> 0.0283

But these values are worse than the “solo” Lovazh loss.

Thus, it makes sense to investigate combinations of functions of different nature on our data only in competition conditions or in the presence of free time and resources. To achieve a significant increase in quality is unlikely to succeed.

Problem 7. “Best” loss function regularization

At this step, we tried to improve the previously selected loss function with an addition designed specifically for seismic data. This is a regularization described in the article: «Neural-networks for geophysicists and their application to seismic data interpretation».

The article mentions that standard regularizations like weights decay on seismic data do not work well. Instead, an approach based on the norm of the gradient matrix is proposed, which is aimed at smoothing the boundaries of classes. The approach is logical if we recall that the boundaries of the geological layers should be smooth.

However, when using such regularization, one should expect some deterioration in the results by the Jaccard criterion, since smoothed class boundaries will less likely coincide with possible abrupt transitions obtained with manual markup. But we have one more criterion for verification — the number of connected components.

We trained 13 networks with the regularization described in the article and the coefficient in front of it, taking values from 0.1 to 0.0001. The figures below show some of the ratings for both criteria.

Figure 15. Comparison of the quality of the network forecast by the selected criteria

Figure 16. Statistics for the number of connected components in terms of coefficient values before regularization

It is seen that regularization with a coefficient of 0.025 significantly reduced the statistics for the criterion for the number of connected components. However, the Jaccard criterion in this case expectedly increased to 0.0357. However, this is a slight increase compared to a reduction in manual refinement.

Figure 17. Comparison of network forecasts by the number of predictions with the specified number of connected components

Finally, we compare the class boundaries on the target and predicted masks for the previously selected “worst-case” slice.

Figure 18. Network forecast for one of the “worst” slcies

Figure 19. Overlaying part of the horizons of the target mask and forecast

As can be seen from the figures, the forecast mask, of course, is mistaken in some places, but at the same time it smooths the oscillations of the target horizons, i.e. corrects minor errors in the initial markup.

Summary characteristics of the selected loss function with regularization:

about 87% of forecasts are close to ideal, i.e. require no more than just few adjustments to individual sections of the horizons;
approximately 10% of forecasts contain 1 extra spot, i.e. require minor improvements;
about 3% of forecasts contain from 2 to 5 extra spots, i.e. require a little more substantial refinement.

Conclusion

Only by adjusting one learning parameter — the loss function — we were able to significantly improve the quality of the network forecast and reduce the amount of necessary refinement by about three times.
With an acute lack of time for experiments, it is worth taking any of the Intersection over Union loss functions (Lovazh loss turned out to be the best in the considered problem) with the selection of a coefficient for smoothing regularization. If you have time, you can experiment with combinations of this class of functions with cross-entropy, but the chances of a significant increase in quality are slim.
This approach is good for business tasks, because it corrects minor errors in the initial layout. But regularization is unlikely to be useful in a competitive environment, as degrades the target metric score.

Further reading:

Reinaldo Mozart Silva, Lais Baroni, Rodrigo S. Ferreira, Daniel Civitarese, Daniela Szwarcman, Emilio Vital Brazil. Netherlands Dataset: A New Public Dataset for Machine Learning inSeismic Interpretation (https://arxiv.org/pdf/1904.00770v1.pdf)
Lars Nieradzik. Losses for Image Segmentation (https://lars76.github.io/neural-networks/object-detection/losses-for-segmentation/)
Daniel Godoy. Understanding binary cross-entropy / log loss: a visual explanation (https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a)
Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3D fully convolutional deep networks (https://arxiv.org/pdf/1706.05721.pdf)
Maxim Berman, Amal Rannen Triki, Matthew B. Blaschko. The Lovasz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks (https://arxiv.org/pdf/1705.08790.pdf)
Bas Peters, Eldad Haber, and Justin Granek. Neural-networks for geophysicists and their application to seismic data interpretation (https://arxiv.org/pdf/1903.11215v1.pdf)

Neural nets on seismic data: thoughts on loss function selection and tuning

Written by Ann Antonova