Sergey Tsimfer
Mar 2 · 9 min read

For several years, Oil & Gas industry, just like many others, is trying its hardest to implement machine learning into its workflows. If done right, deep learning can enhance accuracy and simultaneously speed up the entire pipeline of oil production, from the very early stages of exploration to drilling.

But, needless to say, researchers must carefully inspect hundreds of architectures and training approaches before the deployment of neural networks into the daily work of seismic specialists. This calls for quick model prototyping, which in turn needs the ability to rapidly generate data.

Unfortunately, core formats for storing volumetric seismic information were not built with speed in mind: the most common, SEG-Y, was developed almost 30 years ago and its design had totally different concerns, like memory footprint and ease of sequential processing. Furthermore, labels of various types (horizons, faults, facies, etc) can come from several software packages, each of which stores them in a completely different way: point cloud, regular grid, interpolation nodes, etc.

Working with all of these data types, and, more importantly, making it quick, requires a lot of code with a well-thought structure. This is exactly what seismiQB is: an open-sourced Python framework, capable of interacting with a good deal of seismic data storages by highly optimized load procedures. It also gives an interface for making and deploying machine learning models by relying on our other library BatchFlow: you can learn more about its features in a separate publication.

In this article, we will go through all the steps, that must be implemented to train and deploy ML in a typical geology workflow:

  • loading data from seismic cubes
  • creating targets (i.e. segmentation masks) from multiple types of labels
  • evaluating the quality of labels
  • configuring and training neural networks
  • assembling individual patches with predictions

For each of those stages, we provide a short explanation, a code snippet, and a quick peek into the inner workings of our library together with the motivation behind design choices.

Check out the Seismic Horizon Detection with Neural Networks article: it contains a brief introduction into the field exploration pipeline, as well as an example of model implementation into one of the world’s largest petroleum company.

Let’s dive right in!

Load seismic data

The easiest way to solve this problem is to convert seismic cubes into actual (physical) 3D arrays: for example, the usual scientific HDF5 format matches our needs perfectly. By transforming the data once, we are able to decrease load times by a factor of ~30!

To perform the conversion, we need to pass through the entire cube. It is a good idea to collect trace-wise statistics at the same time: means, standard deviations, etc. They can help greatly to normalize our data to the desired range later.

In some of the workloads, data must be generated from a cube in a predictable manner: for example, during inference, we want to sequentially look through patches of the same slice. That is where cache (specifically, LRU cache) comes in handy: it stores frequently accessed slides in memory and retrieves them at a moment’s notice.

Under the hood, our SeismicGeometry class main purpose is to provide API for loading the desired location of the cube. Each of the formats (currently, we can work with SEG-Y, HDF5, and NPZ) are its subclasses, and it is easy to add more of the supported data storages. This allows us to immensely speed up loading pipelines, depending on the exact profile of data retrieval in the solved task.


Each of them is represented in our framework with a separate class, that encapsulates the entire logic of working with it: from reading the storage to dumping predicted results. Just like the main purpose of a Geometry is to load data, every Label class prime objective is to create a segmentation mask that can be used as the model target.

For every object detected by our technologies, we develop a number of quality assessment procedures and metrics. Each of them analyzes how well the labeled structure fits the seismic wavefield, manipulating huge data arrays in memory; these computations are heavily vectorized with the help of NumPy. Due to the almost identical interface of Numpy and CuPy, we were able to move some of the most demanding operations on GPUs, speeding up postprocessing and assessment stages tenfold.

Just like Geometry, labels are easily extendable by inheriting the right class: that allows us to start prototyping models for new structures, for example, estuaries and reefs, with just a few changes to the existing Horizon class.

Data generation

  • generate patches uniformly from the entire cube
  • concentrate on the areas near labeled objects
  • sample location closest to the hardest labels, depending on their weight

To make all of the above possible, we’ve created a Sampler class. In the most basic form, it generates locations based on the histogram of the labeled point cloud, but it also provides an interface for truncating this distribution, re-weighting it, applying arbitrary transformations, and even creating mixtures of such distributions.

In most of our tasks, we want to learn from a small number of cube slices with inference on the entire field. To do so on archive projects, where hand-made labels are already available, we restrict our data generation strategy by making use of Sampler capabilities: that takes just a few lines of code!

Dataset and Pipeline

Before using the batch of data to update model weights, one may want to apply additional transformations and augmentations to it. We may also want to generate images or masks in a particular way, and that is where Pipeline comes in handy. It defines the exact specification of actions, performed on the batch: it is easy to read and easy to modify.

Note the visual clarity of the Pipeline: it describes the whole training process, from start to finish, in just one cell of code. At the same time, it makes our code reproducible and self-documenting, which is rare to stumble upon in the modern world of data science.

Clarity, brevity and reproducibility are the core values of our design across multiple libraries. Readability matters.


Note the microbatch parameter: it allows us to accumulate gradients during training from just a portion of the actual batch. It is absolutely crucial, as even patches of seismic data are very sizeable and don’t fit tight GPU memory constraints. For example, assuming the patch shape (256, 256), the maximum batch size for a training iteration of vanilla UNet on a video card with 11Gb of memory is only 20. In a world, where state-of-the-art computer vision models are trained with thousands of items in a single batch, it is just not enough.

You can learn more about ways to construct and train neural networks, pipelines, and other features of BatchFlow in a dedicated article.


Thankfully, it turns into a necessity to write just a couple of functions, that would:

  • create a regular grid over the volume to make (possibly, overlapping) predictions on
  • apply the trained model for each of them
  • combine predictions into one array, using the regular grid

As the assembled array may take up a lot of memory, we implement the option to do this process in chunks. That is, essentially, a speed-memory tradeoff: the bigger the array, the faster we can apply vectorized computation on it, eating up a significant portion of our RAM. Besides that, there are many optimizations for determining which exactly patches should and should not be used as part of the grid: for example, if the cube values fade away in a certain region, there is no need to waste our time on them.

Aside from inferencing on a regular grid, we also provide API for applying model near desired point cloud: that can be used to make predictions along a certain horizon or another surface.


  • horizon picking: clearly visible boundaries between layers represent changes in properties and are essential for building a 3D model of the field. You can learn more about it in a demo notebook or separate publication
  • fault detection: vertical discontinuities are as important for the structural model, as horizontal ones. Here you can overview the process of locating them
  • alluvial fans and feeder channels are one of the most prominent oil reservoirs and their identification is essential for petroleum production. We can locate them along the already tracked horizons with an additional neural network
  • general facies recognition is essential for outlining key objects with the same sedimentation properties. A brief demo can be found here

Other models, that did not make the list, can be found in the directory with Jupyter Notebooks: each of them follows the same structure, with detailed descriptions of datasets, models, training process, inference, quality evaluation, and suggestions for improvements.


Combining this all, seismiQB provides a blueprint for the acceleration of various stages of exploration workflow by an order of magnitude. It also frees geologists from routine daily work, making room for far more interesting and engaging tasks.

What next

For now, the plan is to share more details on the tasks we’ve successfully solved, as well as keep you posted on our more recent advances. Stay tuned!

Data Analysis Center

Machine learning for Oil and Gas Exploration. Advanced methods to locate oil deposits

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store