Sergey Tsimfer
Data Analysis Center
10 min readMay 15, 2023

--

Machine learning community has come a long way in terms of working with data. In the past, we could have stored everything in RAM and enjoy blazingly fast access to inputs and targets for model training. Nowadays, with data becoming much bigger and spanning terabytes of disk space, researchers have to employ optimized ways of loading chunks of datasets into memory, utilizing parallelization for I/O operations and preprocessing.

Unfortunately, models for seismic exploration are not at the same level of optimization; that is, until now. To squeeze those speed ups for our automated processing and interpretation pipelines, we’ve developed a library segfast, that outperforms other SEG-Y loaders when it comes to raw reading time. Moreover, it is developed with machine learning in mind, making API convenient for continuous loading of training batches.

Spoiler: average time taken for loading a huge batch of data from SEG-Y for different methods

Throughout the article, we will be using examples in segfast and our other library, seismiqb, which is based upon it and automates seismic interpretation with machine learning.

SEG-Y standard intro

SEG-Y is the most popular format for transferring seismic data, both at processing and interpretation stages. Under the hood, each file consists of two logical parts:

  • binary (possibly, extended) header, that contains file-wide information about the survey. Data type, number of samples, sample interval and even some textual information are all stored there;
  • trace-wise information, that consists of fixed-length header (again, describing meta-info about that trace) and actual values.

As traces are stored separately, loading a continuous chunk of data
(2D slice or 3D sub-volume) would not be very effective, especially if done within Python, where most of ML models live. Fortunately, an excellent library by Equinor, segyio, somewhat alleviates that problem by providing a Python binding to the C++ core functions. But, while providing faster way to load particular patterns from SEG-Y file, this approach lacks flexibility.

Bufferization is key

One of the best ways to speed up things is to avoid unnecessary copies and read everything directly into the same memory space, which will be later used for model training or inference. Consider a trivial example:

paths = ...

images = [load_image(path) for path in paths]
batch = np.stack(images, axis=0) # used for model training

Here we first load images as separate objects, which are later stacked into continious array. That requires (at least!) two allocations of batch-sized memory blocks: one for the list of separate images and the other for batch.

One of those allocations can be removed: by specifying, which block of memory to load the image into, we could have spared one copy. Imagine if we could have been able to do something like this:

n = 128
crop_shape = (1, 512, 512)
dtype = np.float32
paths = ...

buffer = np.empty((n, *crop_shape), dtype=dtype)
for i, path in enumerate(paths):
load_image(path, buffer=buffer[i]) # make buffer[i] equal the image

This second type of API, where we can target specific memory block to load the data into instead of allocating new memory, may be significantly faster and migrating to it is well-worth the effort. While working with seismic data, processing-stage cubes can span terabytes of disk space and take a lot of time just to scan through: making even one additional copy on the way is very prohibitive. Sad to say, this fine control over the loading process is not something provided by the current segyio.

It is worth noting that by using some not-so-public methods of segyio (looking at you, xfd.gettr) one can achieve bufferization at the level of traces: that is, we can specify a memory address for one trace to be loaded in. Unfortunately, that would make us query each trace load at Python level and be at the mercy of GIL. That would be still faster than public methods of the same library, at least for some shapes and number of loaded subvolumes, but there must be a better way!

Data conversion

Before digging deeper into faster SEG-Y loads, are there any easy (and better) ways to make I/O quick? Indeed there are! As we’ve discussed before, the main drawback of SEG-Y format is that it is trace-wise: any other loading pattern would inherently be slow.

Thankfully, we don’t have to stick to that one format! There are well-designed and optimized scientific formats like HDF5 and Zarr, that have convenient Python bindings, provide chunking of arbitrary shape and compression, which can both speed up reads and conserve disk space.

This approach is great for researches but pretty much unusable in production or applications. For example, our horizon detection model takes just a few (4–7) minutes to complete. Adding the conversion time, which can easily take up another 15 minutes for formats with compression and tight chunking, would kill any interactivity. The same goes for fault probability computation: instead of plugging in your seismic data and getting images nearly in real-time, one would need to wait for the conversion step to be done. Last but not least, managing multiple files in addition to the usual hundreds of horizons, faults, wells, facies etc in your project is incredibly cumbersome and you want to avoid it at all costs.

Note that all points about memory bufferization still apply: data loading can be significantly speed up by preallocating memory and using it as the target buffer for interacting with either of scientific storages.

Making data go brrr

Going back to the SEG-Y, we’ve noticed some missed optimization opportunities in regards to memory preallocation. Moreover, except for the first ~3600 bytes the file is a huge well-structured binary. To our luck, numerical python provides great tools for working with such mediums!
As you might have already guessed, efficient work with such a file comes to creating a special memory map over it.

A memory map is, essentially, an instruction on how the memory is laid out. By carefully selecting parameters of its creation, we can extract requested headers and/or amplitude values from the memmap, treating it as huge 2D array of (traces, samples) dimensions. The parameters to initialize memory mapping are:

  • offset, which is fixed to 3600 bytes almost always for our case. This signals to skip the binary header and get straight to the trace-wise data;
  • dtype, which tells memory map how big each element is and how to interpret it. In our case, each element is a trace and we want the dtype to describe the size of trace header (240 bytes) plus the number of samples multiplied by item size bytes overall.
    For example, if each trace consists of 1500 float32 samples, correct dtype would have 240 + 1500 * 4 = 6240 bytes in total;
  • size/shape, which tells how many dtype-sized elements there are. You can compute it just by dividing the total size of the file (minus the binheader) by the size of your selected dtype.

By using numpy’s memory map, you can also conveniently name different parts of your dtype. Combining everything, we get:

import numpy as np
offset = 3600 # all of those can be found by
n_samples = 1500 # inspecting binary header of SEG-Y
dtype = '>f4' # and file properties (size)

# mmap_dtype: each trace is 240 bytes of headers and then amplitudes
mmap_dtype = np.dtype([('headers', np.void, 240),
('data', dtype, n_samples])
mmap = np.memmap(path, mode='r', dtype=mmap_dtype,
shape=n_samples, offset=offset)

mmap['data'] # convenient access to data, returns (N, 1500)-shaped array
mmap['headers'] # access to headers as bytes

This freshly created memory map can be used for both headers and data retrieval. Importantly, as it is a part of one of the core packages of Python, it is much better optimized than custom solutions; out of the box, it is already 10–15% faster even for slice loading (which is a forte of segyio)!

Another benefit is that the entire reading process is in pure Python and takes just a few lines of code. Higher level of abstraction makes it trivial to implement new ideas analogously to the already developed.

The last piece of the puzzle is the indexing: suppose we want to get traces on the inline #345. How do we know which trace indices to use?

For now, the indexing is not a part of segfast library: instead, we create indexers in our processing and interpretation modules. The reason is simple: the way you structure your file depends on the stage you are working in. In interpretation, there is a rigorous 3D structure imposed: the file can be oriented in (inlines, crosslines, samples) dimensions. For processing tasks, that is usually not the case and changes from task to task.

segfast wraps the entire memory mapping machinery into oneMemmapLoader class. As a fallback and for a fair comparison, we also implement SegyioLoader and SafeSegyioLoader, which follow the exact same API but rely on private and public functions of segyio. Each of them provides tools for loading trace headers, traces by indices and depth slices.

Under the hood, we rely on time-proven segyio routines to infer stats from file-wide header. This way, we don’t reinvent the wheel and make sure that everything works with different revisions and flavors of SEG-Y.

Timings, timings, timings!

All that is left is to compare performance of different ways to interact with the same SEG-Y file. For out tests, a 21-GB sized amplitude data with IEEE float32 format is used. We test following methods for reading:

  • public segyio functions. .ilines, .xlines, .trace.raw are used;
  • private segyio functions, wrapped into SegyioLoader;
  • segfast with MemmapLoader, that relies on memory map mechanisms;
  • seismiqb, our interpretation library. It relies on MemmapLoader for interacting with SEG-Y data and provides additional layer of bufferization for batch dimension;
  • HDF5, also provided by seismiqb. The conversion takes about 4 minutes, during which 3 projections in all orientations are created with no chunking nor compression. In our experience, the performance is relatively similar to Zarr storage so we don’t compare it separately.

Lastly, we include the segfast-quantized loader: it takes the best from both worlds of convenient geological transporting and fast scientific formats. Essentially, the original SEG-Y standard allows to store values of other than the usual float32 types. By carefully quantizing the data, we can use one-byte unsigned integers instead, reducing the disk usage 4x times and speeding up the loading comparably. Moreover, many geological programs do work with uint8 cubes, which allows us to rip the benefits during both research and production.

The same inline for original float32 dtype and quantized uint8. Visually, there are hardly any changes!

Each loader was tested against multiple tasks: loading slices in different (inline, crossline, depth) projections, loading 3D crops and even loading a batch of 3D crops. The shape of crops is fixed to (256, 256, 500), and batch consists of 20 of such crops.

Average time taken for slide loading across different orientations

Each experiment is repeated a few hundred times to get reliable averages. The key takeaways are:

  • as expected, memory map based methods are much faster, achieving at least 30% speedup compared to traditional segyio usage even before quantization;
  • crossline-oriented slices are almost twice slower than inline ones: that is natural due to the data alignment;
  • depth-oriented slices are waaay slower, as SEG-Y format is completely inadequate for working with them. HDF5 absolutely smokes the competition here, and if you need those slices frequently (for example, while performing attribute analysis) conversion may be your best bet;
  • quantization provides a massive speedup, while not incurring any meaningful information loss for a lot of tasks. For example, all kinds of structural interpretation can be done on quantized cubes;
  • curiously, the public methods of segyio are faster than the private for inline slide loading. That is due to a slightly better optimization for this exact task and this advantage is lost for other settings.
Time taken for data loading

Another way of speeding up your model is to do all CPU and I/O bound operations in a separate thread or process. Our main library batchflow does so seamlessly: be sure to give it a go!

Even more timings

Before loading any data from SEG-Y, we need to infer its structure via analyzing trace-wise headers. By utilizing the same memory mapping mechanism, we speed up this part of workflow massively, allowing to load all headers for the entire file in a fraction of time taken by other packages.

Another crucial step of most algorithms is to export the data in SEG-Y format: for example, our models compute fault probability for each voxel in the original file, resulting in the same-shaped cube with same trace-wise meta. Again, memmap is far superior than other approaches; it is also trivial to parallelize for large tasks like header reading and array export.

Time taken for misc tasks

Finally, the SEG-Y files can come with legacy IBM floats inside: that happens surprisingly often as a result of old processing software. Working with such files requires data conversion to regular IEEE floats at each access, which takes a lot of time. To circumvent this, we have an utility function to convert the file once into faster data format.

Conclusion

Today you learned about new library for working with the most popular seismic transport format. As it is developed for the needs of machine learning, it compares favorably against other packages on the most demanding task: data reading, as well as on working with trace headers and exporting results in the same SEG-Y format.

On the github page of segfast, you can find tutorials and notebooks with examples. Be sure to star the repo and stay in touch for more updates on our processing/interpretation advances!

Check out our packages that automate seismic processing and interpretation, and learn how to use neural networks for first break picking, structural interpretation, facial analysis and more!

--

--

Sergey Tsimfer
Data Analysis Center

Machine learning engineer with years of experience in automating geoscience