BatchFlow: a framework for ML development

Alexey Kozhevin
Data Analysis Center

--

As work on a data science project starts, you write code to get acquainted with the data, test hypotheses and train a baseline model. At some point, this code needs to be organized in a way that allows experimentation and collaboration on project code. In this case, the BatchFlow library can come to the rescue to conveniently structure the code and implement data processing as pipelines.

BatchFlow is an open-source Python framework to deal with data handling, ML model training and all related things. One can use the library for constructing very clear and expressive pipelines that describe:

  • data loading and processing
  • model training
  • validating of the model
  • inference
  • metrics computation

The framework also includes a variety of methods for image processing and ready-to-use architectures of neural networks in TensorFlow and PyTorch. Moreover, BatchFlow has a convenient interface for running parallel experiments on multiple GPUs.

This article begins a quick tour of BatchFlow functionality. For a more detailed description, see documentation and tutorials.

Full-featured example

Consider the example of the BatchFlow usage where we train and validate a model to segment images from PASCAL dataset. It may seem long but it is clear and self-explanatory.

The code is well structured and can be easily read and edited. Pipeline describes the way each batch passes: from data loading to model training/evaluation. It is a lazy description and computations will only be executed after run call.

Why bother with creating yet another library?

Our main focus is the development of applied libraries: seismiqb, SeismicPro and PetroFlow for geological implementations, RadIO, CardIO for medical applications and others. BatchFlow serves as a common base and contains functionality common to all tasks: creating batches, chaining methods into pipelines, item-wise parallelisation for processing, training models, etc.

Each library takes advantage of this functionality and adds task-specific methods, such as loading seismic data in SEG-Y format for SeismiQB or heart signal processing methods in CardIO.

There are many similar frameworks. Take a look at the short comparison:

As mentioned before, we focus on applied libraries that can be implemented by any interested party. So BatchFlow defeats existing libraries due to the fact that it can be understood and utilized from scratch, since:

  1. BatchFlow allows you to store almost all data processing and model methods in a single pipeline, rather than splitting it up into separate parts implemented using different libraries.
  2. It contains a flexible instrument to configure ML models.
  3. All projects have the same structure and style so it is very easy to start a new project and join a team as a new user, developer or even maintainer.

How to use BatchFlow

Our framework has a well-organized hierarchy of classes which can be reused to add the functionality necessary for a specific task with minimal effort. Let’s consider the main classes from the point of view of the tasks that they solve:

  • data loading and processing
  • creating pipelines
  • creating and training models
  • pipelines profiling

How to work with datasets

Often you deal with huge datasets and batches have to be created on the fly from the persistent storage. For example, satellite images can take tens of gigabytes. With FilesIndex class you can create a unique index for each data item consists of paths to images:

FilesIndex contains necessary methods to generate batches of indices to take items from dataset. When calling next_batch method, FilesIndex will produce batches of indices. For example,

Note that index just defines the way to refer to data and create batches of item indices. Data will be loaded later when running pipelines.

You can use an index in the more complicated (and more probable) case when each item consists of an image and mask stored in one folder. To make an index from a folder just adddirs flag into FilesIndex creation.

If dataset can be stored in memory as an array (e.g. MNIST), each item can be specified by numerical index. DatasetIndex makes possible the work with them in a similar way.

Thus, DatasetIndex and FilesIndex are intended for data indexing and batch creation. If you need custom index class, this can be done simply by inheriting class from DatasetIndex or FilesIndex and implement the necessary features.

How to work with batches

It’s not enough to load batches to train model, they must be processed in some way (e.g. augmenting it or retrieving certain parts). Batch class contains transformations to be applied to batch. In order to define specific methods for your data which can be chained into the pipeline:

  1. create a class inherited fromBatch or else from ImagesBatch and use predefined methods to load, process and augment images
  2. add method decorated with action

You can use the decorator inbatch_parallel to process batch items in parallel:

Action some_method gets an index of the batch item and returns a processed value. Batch items will be handled separately depending on the parameter target which specifies a parallelization engine: threads, async, mpc or for. Thus you can define processing of one item of data without any thoughts about the splitting of the batch. Parallelism is implemented using Python with all the ensuing consequences.

Besides, you can chain into pipeline methods from the current namespace. In the PASCAL example, we define process_mask method to transform labels.

How to make pipelines

Pipeline is a way to chain Batch and standard actions (e.g. init_model, train_model). For example, let’s describe pipeline that loads and augments batch:

It was mentioned in the example that all the actions are lazy. They are not executed until their results are needed, for instance when you request a processed batch. Actions load, rotate and scale is defined in ImagesBatch and we use them to load and transform images.

The attentive reader might notice that the pipeline will augment data not so perfectly good: each item will be transformed in the same way so we don’t add diversity to our dataset. To fix this, parameters must be sampled randomly.

Class R (means “Random”) is needed for sampling a new angle and new scale factor from uniform distributions for each batch. We also use the parameter p for action application to batch items with 80% probability. Now parameters will be randomly sampled but each item will have the same values. Let’s fix it.

Here we add the wrapperP (means “Parallel”) to make augmentations correct and each batch item will have its own sampled parameters. As batches will be created only when calling the run method of a pipeline, an instance of the batch doesn’t exist on the pipeline creation stage and even the batch size is not known. That's why parameters can’t be sampled immediately and you have to define the way for sampling them when batch will be created. BatchFlow includes many wrappers (so-called named expressions) to refer to objects that exist only at the pipeline execution stage.

You can split the pipeline into logical parts and then sum them.

As we said previously, pipeline describes computations lazily. To execute all of them, callrun method. Bitwise left shift joins dataset to the pipeline.

Dataset represents a collection of elements and knows how to create batches of them using DatasetIndex/FilesIndex and how to process batches with Batch-inherited class.

How to define and use the model

You have prepared pipelines, described loading and processing batches but how to put them into models? Don’t worry, BatchFlow includes predefined (not pretrained) model zoo written in TensorFlow and PyTorch. It’s enough just to import and train the most used architectures, for example, VGG, ResNet, ResNeXt, DenseNet, UNet, EfficientNet and many others.

The first example in that paper shows how to use UNet class. Model initialization and training actions in the pipeline can de defined in the following way:

In the simplest case, model config includes only hyperparameters to configure the output of the network but it also can include many other things. You can even define the entire architecture with just a model config.

B is a named expression as R and P and used to define inputs of the model. Inputs are defined in that way because batches will be created only at the pipeline execution stage.

How to profile pipelines

Sometimes batch methods can be written inefficiently and become a bottleneck. To solve that problem, add profile=True into run of your pipeline. As a result, you will get a detailed report with execution times of each action in pipeline.

What else can BatchFlow do?

You got to know the main ideas and structures of the library but it is the only tip of the iceberg. BatchFlow you also can:

Articles about this functionality will be published and linked to this section.

Summary

We introduced BatchFlow library intended to make the development of ML models easier and clearer by using self-explanatory pipelines. It provides necessary instruments to

  • refer to dataset items and create batches,
  • organize processing methods in a convenient way using class Batch,
  • implement within batch parallelism,
  • create pipelines,
  • use predefined TensorFlow and PyTorch architectures,
  • profile pipelines to find performance bottlenecks.

Thus the library can help to produce a structured and reproducible code when you work on ML projects.

--

--