Sergey Tsimfer
Data Analysis Center
9 min readMar 2, 2021

--

For several years, Oil & Gas industry, just like many others, is trying its hardest to implement machine learning into its workflows. If done right, deep learning can enhance accuracy and simultaneously speed up the entire pipeline of oil production, from the very early stages of exploration to drilling.

But, needless to say, researchers must carefully inspect hundreds of architectures and training approaches before the deployment of neural networks into the daily work of seismic specialists. This calls for quick model prototyping, which in turn needs the ability to rapidly generate data.

Unfortunately, core formats for storing volumetric seismic information were not built with speed in mind: the most common, SEG-Y, was developed almost 30 years ago and its design had totally different concerns, like memory footprint and ease of sequential processing. Furthermore, labels of various types (horizons, faults, facies, etc) can come from several software packages, each of which stores them in a completely different way: point cloud, regular grid, interpolation nodes, etc.

Working with all of these data types, and, more importantly, making it quick, requires a lot of code with a well-thought structure. This is exactly what seismiQB is: an open-sourced Python framework, capable of interacting with a good deal of seismic data storages by highly optimized load procedures. It also gives an interface for making and deploying machine learning models by relying on our other library BatchFlow: you can learn more about its features in a separate publication.

In this article, we will go through all the steps, that must be implemented to train and deploy ML in a typical geology workflow:

  • loading data from seismic cubes
  • creating targets (i.e. segmentation masks) from multiple types of labels
  • evaluating the quality of labels
  • configuring and training neural networks
  • assembling individual patches with predictions

For each of those stages, we provide a short explanation, a code snippet, and a quick peek into the inner workings of our library together with the motivation behind design choices.

Check out the Seismic Horizon Detection with Neural Networks article: it contains a brief introduction into the field exploration pipeline, as well as an example of model implementation into one of the world’s largest petroleum company.

Let’s dive right in!

Load seismic data

Seismic cubes are usually stored in a SEG-Y format: it contains individual traces together with the detailed meta-information. During preprocessing steps, both data values and meta-information change; by the time cube reaches the interpretation stage, we can think of it as of enormous 3D array of traces. Nonetheless, under the hood, it is still stored as individual traces: that don’t comply with the necessity of fast loads of slides or subvolumes.

The easiest way to solve this problem is to convert seismic cubes into actual (physical) 3D arrays: for example, the usual scientific HDF5 format matches our needs perfectly. By transforming the data once, we are able to decrease load times by a factor of ~30!

To perform the conversion, we need to pass through the entire cube. It is a good idea to collect trace-wise statistics at the same time: means, standard deviations, etc. They can help greatly to normalize our data to the desired range later.

In some of the workloads, data must be generated from a cube in a predictable manner: for example, during inference, we want to sequentially look through patches of the same slice. That is where cache (specifically, LRU cache) comes in handy: it stores frequently accessed slides in memory and retrieves them at a moment’s notice.

Under the hood, our SeismicGeometry class main purpose is to provide API for loading the desired location of the cube. Each of the formats (currently, we can work with SEG-Y, HDF5, and NPZ) are its subclasses, and it is easy to add more of the supported data storages. This allows us to immensely speed up loading pipelines, depending on the exact profile of data retrieval in the solved task.

Labels

For model training, we need both input data (seismic images) and target. Obviously, the target is different for every task: it can be a horizon, a fault, or a set of seismic structures, to name a few.

Each of them is represented in our framework with a separate class, that encapsulates the entire logic of working with it: from reading the storage to dumping predicted results. Just like the main purpose of a Geometry is to load data, every Label class prime objective is to create a segmentation mask that can be used as the model target.

For every object detected by our technologies, we develop a number of quality assessment procedures and metrics. Each of them analyzes how well the labeled structure fits the seismic wavefield, manipulating huge data arrays in memory; these computations are heavily vectorized with the help of NumPy. Due to the almost identical interface of Numpy and CuPy, we were able to move some of the most demanding operations on GPUs, speeding up postprocessing and assessment stages tenfold.

Just like Geometry, labels are easily extendable by inheriting the right class: that allows us to start prototyping models for new structures, for example, estuaries and reefs, with just a few changes to the existing Horizon class.

Data generation

Geometry and Label-derived classes provide interfaces to load both seismic images and targets: we are yet to discuss the exact locations from where they should be sampled. There are quite a few options:

  • generate patches uniformly from the entire cube
  • concentrate on the areas near labeled objects
  • sample location closest to the hardest labels, depending on their weight

To make all of the above possible, we’ve created a Sampler class. In the most basic form, it generates locations based on the histogram of the labeled point cloud, but it also provides an interface for truncating this distribution, re-weighting it, applying arbitrary transformations, and even creating mixtures of such distributions.

In most of our tasks, we want to learn from a small number of cube slices with inference on the entire field. To do so on archive projects, where hand-made labels are already available, we restrict our data generation strategy by making use of Sampler capabilities: that takes just a few lines of code!

Dataset and Pipeline

All of the discussed entities form a dataset: multiple seismic cubes with attached labels, as well as the mechanism to generate locations from them in the desired way. The dataset contains methods for interactions between labels and cubes; nonetheless, its main purpose is to sample batches of actual training pairs: images and masks.

Before using the batch of data to update model weights, one may want to apply additional transformations and augmentations to it. We may also want to generate images or masks in a particular way, and that is where Pipeline comes in handy. It defines the exact specification of actions, performed on the batch: it is easy to read and easy to modify.

Note the visual clarity of the Pipeline: it describes the whole training process, from start to finish, in just one cell of code. At the same time, it makes our code reproducible and self-documenting, which is rare to stumble upon in the modern world of data science.

Clarity, brevity and reproducibility are the core values of our design across multiple libraries. Readability matters.

Model

In the pipeline snippet, we initialized a neural network with the help of a model_config dictionary:

Note the microbatch parameter: it allows us to accumulate gradients during training from just a portion of the actual batch. It is absolutely crucial, as even patches of seismic data are very sizeable and don’t fit tight GPU memory constraints. For example, assuming the patch shape (256, 256), the maximum batch size for a training iteration of vanilla UNet on a video card with 11Gb of memory is only 20. In a world, where state-of-the-art computer vision models are trained with thousands of items in a single batch, it is just not enough.

You can learn more about ways to construct and train neural networks, pipelines, and other features of BatchFlow in a dedicated article.

Inference

Having a trained model, we want to apply it to the unseen data. Unlike most machine learning problems, where one can just change the set of data to draw examples from, the inference stage in seismic interpretation is significantly different: predictions must be made on patches from the entire seismic volume and then stitched together.

Thankfully, it turns into a necessity to write just a couple of functions, that would:

  • create a regular grid over the volume to make (possibly, overlapping) predictions on
  • apply the trained model for each of them
  • combine predictions into one array, using the regular grid

As the assembled array may take up a lot of memory, we implement the option to do this process in chunks. That is, essentially, a speed-memory tradeoff: the bigger the array, the faster we can apply vectorized computation on it, eating up a significant portion of our RAM. Besides that, there are many optimizations for determining which exactly patches should and should not be used as part of the grid: for example, if the cube values fade away in a certain region, there is no need to waste our time on them.

Aside from inferencing on a regular grid, we also provide API for applying model near desired point cloud: that can be used to make predictions along a certain horizon or another surface.

Applications

seismiQB is built for and around geological tasks, which usually involve finding several objects on the seismic data and making a structural model out of them. As a developing framework, it is frequently changed to adapt to the needs of our day-to-day work. At the moment, seismiQB is a key component of the following models:

  • horizon picking: clearly visible boundaries between layers represent changes in properties and are essential for building a 3D model of the field. You can learn more about it in a demo notebook or separate publication
  • fault detection: vertical discontinuities are as important for the structural model, as horizontal ones. Here you can overview the process of locating them
  • alluvial fans and feeder channels are one of the most prominent oil reservoirs and their identification is essential for petroleum production. We can locate them along the already tracked horizons with an additional neural network
  • general facies recognition is essential for outlining key objects with the same sedimentation properties. A brief demo can be found here

Other models, that did not make the list, can be found in the directory with Jupyter Notebooks: each of them follows the same structure, with detailed descriptions of datasets, models, training process, inference, quality evaluation, and suggestions for improvements.

Summary

In this article, we overviewed key components of our seismic interpretation framework: Geometry to work with seismic cubes, Label to create target masks for a dozen of geological structures, Dataset, Pipeline and Sampler to train neural networks for a particular task, as well as assembling prediction onto the entire production field.

Combining this all, seismiQB provides a blueprint for the acceleration of various stages of exploration workflow by an order of magnitude. It also frees geologists from routine daily work, making room for far more interesting and engaging tasks.

What next

There are a lot of potential improvements to the library: making segmentation masks right on the GPU and speeding up inference even more to name a few. We are also expanding the range of tasks we can do: for example, by connecting the seismic data with well logs, or by including the inversion in our tools.

For now, the plan is to share more details on the tasks we’ve successfully solved, as well as keep you posted on our more recent advances. Stay tuned!

--

--

Sergey Tsimfer
Data Analysis Center

Machine learning engineer with years of experience in automating geoscience