BatchFlow: a framework for ML development
As work on a data science project starts, you write code to get acquainted with the data, test hypotheses and train a baseline model. At some point, this code needs to be organized in a way that allows experimentation and collaboration on project code. In this case, the BatchFlow library can come to the rescue to conveniently structure the code and implement data processing as pipelines.
BatchFlow is an open-source Python framework to deal with data handling, ML model training and all related things. One can use the library for constructing very clear and expressive pipelines that describe:
- data loading and processing
- model training
- validating of the model
- inference
- metrics computation
The framework also includes a variety of methods for image processing and ready-to-use architectures of neural networks in TensorFlow and PyTorch. Moreover, BatchFlow has a convenient interface for running parallel experiments on multiple GPUs.
This article begins a quick tour of BatchFlow functionality. For a more detailed description, see documentation and tutorials.
Full-featured example
Consider the example of the BatchFlow usage where we train and validate a model to segment images from PASCAL dataset. It may seem long but it is clear and self-explanatory.
The code is well structured and can be easily read and edited. Pipeline describes the way each batch passes: from data loading to model training/evaluation. It is a lazy description and computations will only be executed after run
call.
Why bother with creating yet another library?
Our main focus is the development of applied libraries: seismiqb, SeismicPro and PetroFlow for geological implementations, RadIO, CardIO for medical applications and others. BatchFlow serves as a common base and contains functionality common to all tasks: creating batches, chaining methods into pipelines, item-wise parallelisation for processing, training models, etc.
Each library takes advantage of this functionality and adds task-specific methods, such as loading seismic data in SEG-Y format for SeismiQB or heart signal processing methods in CardIO.
There are many similar frameworks. Take a look at the short comparison:
As mentioned before, we focus on applied libraries that can be implemented by any interested party. So BatchFlow defeats existing libraries due to the fact that it can be understood and utilized from scratch, since:
- BatchFlow allows you to store almost all data processing and model methods in a single pipeline, rather than splitting it up into separate parts implemented using different libraries.
- It contains a flexible instrument to configure ML models.
- All projects have the same structure and style so it is very easy to start a new project and join a team as a new user, developer or even maintainer.
How to use BatchFlow
Our framework has a well-organized hierarchy of classes which can be reused to add the functionality necessary for a specific task with minimal effort. Let’s consider the main classes from the point of view of the tasks that they solve:
- data loading and processing
- creating pipelines
- creating and training models
- pipelines profiling
How to work with datasets
Often you deal with huge datasets and batches have to be created on the fly from the persistent storage. For example, satellite images can take tens of gigabytes. With FilesIndex
class you can create a unique index for each data item consists of paths to images:
FilesIndex
contains necessary methods to generate batches of indices to take items from dataset. When calling next_batch
method, FilesIndex
will produce batches of indices. For example,
Note that index just defines the way to refer to data and create batches of item indices. Data will be loaded later when running pipelines.
You can use an index in the more complicated (and more probable) case when each item consists of an image and mask stored in one folder. To make an index from a folder just adddirs
flag into FilesIndex
creation.
If dataset can be stored in memory as an array (e.g. MNIST), each item can be specified by numerical index. DatasetIndex
makes possible the work with them in a similar way.
Thus, DatasetIndex
and FilesIndex
are intended for data indexing and batch creation. If you need custom index class, this can be done simply by inheriting class from DatasetIndex
or FilesIndex
and implement the necessary features.
How to work with batches
It’s not enough to load batches to train model, they must be processed in some way (e.g. augmenting it or retrieving certain parts). Batch
class contains transformations to be applied to batch. In order to define specific methods for your data which can be chained into the pipeline:
- create a class inherited from
Batch
or else fromImagesBatch
and use predefined methods to load, process and augment images - add method decorated with
action
You can use the decorator inbatch_parallel
to process batch items in parallel:
Action some_method
gets an index of the batch item and returns a processed value. Batch items will be handled separately depending on the parameter target
which specifies a parallelization engine: threads
, async
, mpc
or for
. Thus you can define processing of one item of data without any thoughts about the splitting of the batch. Parallelism is implemented using Python with all the ensuing consequences.
Besides, you can chain into pipeline methods from the current namespace. In the PASCAL example, we define process_mask
method to transform labels.
How to make pipelines
Pipeline
is a way to chain Batch
and standard actions (e.g. init_model
, train_model
). For example, let’s describe pipeline that loads and augments batch:
It was mentioned in the example that all the actions are lazy. They are not executed until their results are needed, for instance when you request a processed batch. Actions load
, rotate
and scale
is defined in ImagesBatch
and we use them to load and transform images.
The attentive reader might notice that the pipeline will augment data not so perfectly good: each item will be transformed in the same way so we don’t add diversity to our dataset. To fix this, parameters must be sampled randomly.
Class R
(means “Random”) is needed for sampling a new angle and new scale factor from uniform distributions for each batch. We also use the parameter p
for action application to batch items with 80% probability. Now parameters will be randomly sampled but each item will have the same values. Let’s fix it.
Here we add the wrapperP
(means “Parallel”) to make augmentations correct and each batch item will have its own sampled parameters. As batches will be created only when calling the run
method of a pipeline, an instance of the batch doesn’t exist on the pipeline creation stage and even the batch size is not known. That's why parameters can’t be sampled immediately and you have to define the way for sampling them when batch will be created. BatchFlow includes many wrappers (so-called named expressions) to refer to objects that exist only at the pipeline execution stage.
You can split the pipeline into logical parts and then sum them.
As we said previously, pipeline describes computations lazily. To execute all of them, callrun
method. Bitwise left shift joins dataset to the pipeline.
Dataset
represents a collection of elements and knows how to create batches of them using DatasetIndex/FilesIndex
and how to process batches with Batch
-inherited class.
How to define and use the model
You have prepared pipelines, described loading and processing batches but how to put them into models? Don’t worry, BatchFlow includes predefined (not pretrained) model zoo written in TensorFlow and PyTorch. It’s enough just to import and train the most used architectures, for example, VGG, ResNet, ResNeXt, DenseNet, UNet, EfficientNet and many others.
The first example in that paper shows how to use UNet class. Model initialization and training actions in the pipeline can de defined in the following way:
In the simplest case, model config includes only hyperparameters to configure the output of the network but it also can include many other things. You can even define the entire architecture with just a model config.
B
is a named expression as R
and P
and used to define inputs of the model. Inputs are defined in that way because batches will be created only at the pipeline execution stage.
How to profile pipelines
Sometimes batch methods can be written inefficiently and become a bottleneck. To solve that problem, add profile=True
into run
of your pipeline. As a result, you will get a detailed report with execution times of each action in pipeline.
What else can BatchFlow do?
You got to know the main ideas and structures of the library but it is the only tip of the iceberg. BatchFlow you also can:
- speed up training process in several ways by multiple GPU utilisation,
- prefetch batches,
- perform multiple parallel experiments with ML models,
- use monitoring tools to check resources utilisation,
- easily construct complex sampling schemes.
Articles about this functionality will be published and linked to this section.
Summary
We introduced BatchFlow library intended to make the development of ML models easier and clearer by using self-explanatory pipelines. It provides necessary instruments to
- refer to dataset items and create batches,
- organize processing methods in a convenient way using class
Batch
, - implement within batch parallelism,
- create pipelines,
- use predefined TensorFlow and PyTorch architectures,
- profile pipelines to find performance bottlenecks.
Thus the library can help to produce a structured and reproducible code when you work on ML projects.