ML in seismic processing

Designing a metric for seismic denoising

Formal validation of ML models is sometimes difficult but one cannot do without it

Antonina Arefina

Published in

Data Analysis Center

8 min readMar 25, 2021

One of the most important things to do when attacking a problem with machine learning (ML) is to define a relevant metric for comparing targets and models’ outputs.
When a neural network is trained in a supervised mode, the researcher feeds it with a train set that consists of input samples and desired targets. A learning algorithm tries to make the model’s outputs to be similar to targets by adapting the model’s parameters. The researcher can also tune the model’s hyperparameters to obtain better quality. All in all, the process is highly iterative, and one needs to evaluate ML models and to select those that perform better.
Evaluating ML performance, particularly when training in supervised mode, is typically reduced to comparing models’ outputs with targets using some relevant metric for a dedicated test dataset. The average metric value across the whole dataset gives the resulting score.
There are several well-established metrics for classic ML tasks, F₁ for classification, or IoU for segmentation, to name a few. But when ML is applied to more specific tasks, it often comes out that standard metrics are not suitable.

Ground roll attenuation

Our team works with tasks for the Oil & Gas industry. One of the most important methods used for investigating oil deposits is seismic exploration.

During a seismic survey, a network of seismic sensors is located on the ground, and an elastic wave is excited, for example, by an explosion. It causes vibrations in the rock mass that are recorded by the sensors. The traces, i.e. graphs obtained from different sensors during a single shot, are combined into seismic gathers. This data is a valuable source of knowledge about the subsurface structure of a region.

A seismic signal is affected by various types of noise, and the one that inevitably presents in any land survey is ground roll. It is generated by near-surface waves that have low velocity and low frequency but significant amplitudes. It overlaps reflected waves and masks the signal of interest. Ground roll attenuation is a basic step in seismic data processing. Its quality strongly affects further steps and final seismic volume quality.

The ground roll filtering task is traditionally solved in a semi-automatic mode, mostly in a trial manner. A human expert selects several seismic gathers and tunes filter parameters to reduce low-frequency noise that is associated with the ground roll. Then the obtained parameters are applied to the whole data. Sometimes this process is repeated several times. The specific filtering steps depend on the software the geophysicist uses and the features of the field in work.

Below is an example of raw and denoised seismic gathers and their difference.

You can see that the noise is not fully removed. The result of ground roll attenuation is always a “low enough” noise level. The other thing to note is that the whole seismic gather is slightly changed, not just the ground roll area. This happens due to the particularities of the attenuation process.

Our goal is to create an ML-driven ground roll attenuation procedure that is fast, precise, and doesn’t require human interference.

Ground roll attenuation quality is usually assessed by a thorough investigation of several seismic gathers. The quality control process, as well as denoising itself, varies depending on data features, available software, expertise, and preferences of a person who does the job.

This absolutely doesn’t do for ML models development. Manual quality control takes too much time, while it is done repeatedly during hyperparameter tuning. Moreover, it is impossible to validate all output data in any reasonable time, as there can be thousands of seismic gathers in a single field. And last but not least the expert assessment is often subjective, as we will show later.

All in all, when we started to develop an ML model for ground roll attenuation we have faced an absence of a formal metric for denoising quality, so we had to create one ourselves.

Metric for comparing seismic gathers

As a base for our formal metric, we have taken one popular step from the manual assessment process. Geophysicists often compute and compare average power spectra for the raw and denoised seismic gathers in two windows, one of which is inside the ground roll area, and the other is outside. The spectra should be similar for the latter case, while the low-frequency components associated with the ground roll should decrease for the former.

Power spectra inside and outside the ground roll area for a raw and denoised seismic gathers

We decided to use a similar procedure to evaluate how close a model’s outputs are to corresponding targets. Now that we compare denoised gathers, the less the difference with the target, the better is model’s output.

There are several steps to be formalized:

the selection of windows,
the way to compare the average spectra.

Selection of windows for computing power spectra

One feature of our ML-based denoising is that the model tries to reduce the modification of traces to the ground roll area and not to change the data outside it. Our targets are the result of a semi-automated process that touches absolutely all samples in a seismic gather. That’s why we stick to comparing power spectra only inside the ground roll area.

When the assessment is done by a human, they visually evaluate the data and manually select a window of interest. In an automatic procedure, for each gather, we simply take 10% of traces with the smallest offsets (they are more likely affected by the ground roll because they come from the receivers the closest to the source). Instead of computing spectra for some fixed times, we compute short-time Fourier transform (STFT) along every trace. The size and shifts of the window are the parameters, but we have fixed them to defaults from NumPy’s implementation. The window shape is ‘boxcar’, i.e. rectangular.

Comparing power spectra

Now we need to evaluate the “differences” between power spectra that we’ve calculated on the previous step. This can be done in a variety of ways. After some experimenting, we’ve chosen the sum of absolute differences.

Combining it all together

Finally, we can average all distances between spectra across all traces and all time shifts and get the resulting value of a difference between two seismic gathers. We have given our metric a short name: Mu.

Note that averaging is made at a final step, and this makes Mu more sensitive to small changes in data.

Computing average Mu value for model outputs and corresponding targets from a test dataset gives a model’s quality score. Actually, this algorithm can be used to evaluate the result of any denoising procedure, manual as well as automated, with respect to some “gold standard”.

Those interested in a formal description of metric Mu can find it here.

Validating metric Mu

To check that metric Mu is adequate to the task of ground roll attenuation, we’ve designed an experiment, which we called Blind test, to clarify the metric’s Mu capabilities and limitations.

Blind test scenario

We’ve processed 10 seismic gathers with 4 different ML models to remove the ground roll.
To each set of processed gathers, a result of the standard semi-automated procedure and a raw input have been added as a reference. We’ve cleared all information about the denoising procedure and shuffled each set.
6 geophysicists were asked to rank denoised gathers in each set from 1 to 5, where 5 corresponded to the best ground roll attenuation. We have asked them to avoid the same ranks to prevent ties.
After the experts had issued their ratings, we have restored the information about the denoising procedure and computed Kendall’s coefficient of concordance (Kendall’s W) for ratings in each set.

This is a sample set that was given to experts with restored information about the denoising procedure and corresponding Mu values.

Sample set of denoising variants from the blind test

Blind test results

Kendall’s W and corresponding p-values for 10 sets from the blind test

In this table, W values for all sets are presented. Larger W values indicate more agreement between raters. There are also corresponding p-values that were computed by bootstrap. Recall that the p-value shows the probability to get the same or greater W by chance, so when the p-value is above some threshold, usually 0.05, we cannot say that there is an agreement between the ratings. Surprisingly, for 7 sets out of 10, p-values were greater than 0.05. It means that experts were unanimous in their ratings for only 3 sets.

The results for the 3 sets with small p-values are in the table below. Grade 1 corresponds to the worst denoising variant, as seen by the expert, and 5 to the best one. S denotes standard semi-automated ground roll attenuation procedure, M0-M3 stands for processing with different ML models.

The experts’ ratings and metric Mu values for 3 sets with small p-values

For these 3 sets, we have computed aggregated experts’ ratings. The ranks for the semi-automated procedure have been dropped because they have served as a base for calculating metric Mu. Then we have re-ranked the sets to get grades from 1 to 4, averaged them, and compared them with the ratings generated by metric Mu. New values are in the below table.

Aggregated experts’ ratings compared to ratings generated by metric Mu

The average experts’ ratings match with those based on metric Mu. This means that Mu reflects an average human judgment, at least when people agree with each other. Based on these results, we decided to use metric Mu in our model design process.

Conclusion

To sum up, creating a good metric for a specific ML task is a very crucial problem. When our team started to develop a neural network model for ground roll attenuation, we’ve found out that the expertise-based procedure used by geophysicists to assess noise reduction quality doesn’t fit the ML setting. We’ve designed a dedicated metric for evaluating the difference between seismic gathers called Mu and studied its usability for quality control of ground roll attenuation.

Our experiment has shown that metric Mu can be used to validate denoising quality. But it has also demonstrated that there can be a significant difference in the assessment of the same data by experts, so it is really important to have a formal automated validation procedure.