Research: a tool for multiple ML experiments

Alexey Kozhevin
Data Analysis Center
11 min readJul 5, 2021

--

The process of working on a data science project can be divided into three main steps:

  • acquaintance with data and development of a baseline solution
  • experimenting with different hyper-parameters, data pre- and post-processing
  • implementation of machine learning models into production applications

Tools for different steps depend on the task. For example, in the task of seismic data analysis, it can be segyio for data loading, numpy and pandas for manipulations with data. Models can be trained using pytorch or tensorflow. At the production stage, the resulting model can be wrapped into a web application using Docker and web frameworks.

As for the second step, you can use DVC to describe and evaluate the separate experiment, MLFlow and TensorBoard to monitor and collect the results. However, sometimes the plan of experiments becomes complex and extensive. For example, when it is necessary to try different architectures of neural networks, for each one try its own hyperparameters, as well as data preprocessing procedures. Then the question arises how to implement the generation of configurations in the more complex cases than can be covered, for example, GridSearchCV from sklearn? How easy is it to implement an automatic (preferably parallel) execution of experiments with different parameter configurations?

Research is a part of BatchFlow, which allows for easy evaluation of multiple parallel experiments with different configurations.
It includes instruments to

  • describe complex domains of parameters
  • create flexible experiment plan as a sequence of callables, generators, and BatchFlow pipelines
  • parallelize experiments by CPUs and GPUs
  • save and load results of experiments in a unified form

To acquaint you with the Research, we will demonstrate its work with a series of simple examples, gradually uncovering new possibilities.

Basic example

Let’s consider the simplest experiment: call function power and save its output:

Generally speaking, the run will iterate over all possible configurations of parameters from the domain (here, it is empty, so there is only one empty configuration). Then for each configuration, it will run several iterations of the experiment (there is only one by default) and save the result. By default, Research creates a folder and stores its results there, but we specify dump_results=False to store results in RAM.

The results can be seen in a special table even during the research execution. They are stored in research.results which can be transformed to pandas.DataFrame by calling research.results.df property:

Here you can see unique identifier of the experiment, name of the stored variable, its value, and iteration of the experiment when the value was received. Now we have only one experiment with one iteration (we will discuss iterations later). Function power has default parameters a=2 and b=3 so it was executed with them.

Now we will complicate the experiment. Keyword arguments can be specified in add_callable method:

Positional arguments can be defined by args variable of add_callable :

You can also add callables into research like this:

In that case, positional and keyword arguments are considered in the same way as in a normal call of this function. Auxiliary keyword arguments like save_to are also can be used.

The result of all researches will be the same: the value of power in results will be equal to 9.

So we know that Research can execute something, but it is still not clear what are the benefits of using it. Next, we will reveal them one by one.

A flexible way to define parameters domain

The first profit is Domain class which is intended to define tricky domains of experiment parameters. In the described experiment, we have two parameters: a and b . Let’s say we want to run an experiment for all possible combinations of parameters a and b which are defined by lists [2, 3]and [2, 3, 4], correspondingly.

EC (abbreviation for experiment config) is a named expression to refer to items of config which will be assigned to the experiment. In general, named expression is a way to refer to objects that don’t exist at the moment of the definition. Thus, EC('key') is for experiment config item, EC() without args stands for the entire experiment config.

The most common named expression is E which allows getting Experiment instance, thereby gaining access to all the attributes of the current experiment. For example, EC() is an alias for E().config.

The results will have two additional columns for config parameters from the domain:

It includes results for six experiments that correspond to parameter combinations. Such Domain instance produces a Cartesian product of parameters. The same domain can be defined as a product of two domains:

But this is still nothing special; it is a typical grid search.

In addition to multiplication, several operations are on a domain that allow you to create complex parameter grids. For example, operation + means concatenation of domains.

See changes in parameters of add_callable . Here we put into power the entire experiment config as a dict of keyword arguments. For example, we will call exactlypower(a=2) for config {'a': 2} from the domain.

The first domain will generate configs{'a': 2}, {'a': 3} and the second {'b': 2}, {'b': 3},{'b': 4}, that’s why the domain produces five configs.

Since all configurations specify only one parameter, and we pass the entire config as keyword arguments to the function, the values ​​of the remaining parameters will be taken by default from the function definition. The resulting data frame will have NaN values for that reason.

Domain also has the @ operator for scalar (elementwise) product for lists of values of the same length. For example, Domain({'a': [2, 3]}) @ Domain({'b': [3, 4]}) will produce two configs: {'a': 2, 'b': 3}, {'a': 3, 'b': 4}.

With these three operations, you can create as complex domains as you like in order. The list of possibilities of Domain is much wider:

  • you can run experiments with the same config several times (may come in handy if your experiment is non-deterministic like neural network training)
  • randomly sample configs from Domain
  • use BatchFlow samplers instead of lists in Domain (which is extremely useful to deal with continuous parameters)
  • modify the domain during research execution based on results
  • one-parameter Domain like Domain({'a': [1, 2]}) can be defined as Option('a', [1, 2]).

All that features are described in the documentation. Note thatDomain itself can be used without Research.

Experiment description

Let’s explain what exactly we mean when we say experiment. This is a repetitive process of calling some actions depending on some parameters. Moreover, some methods can be executed only on some iterations. A good example of an experiment might be a model training with periodic validation (say, each epoch). In that case, iterations correspond to training steps, and actions are training steps and the computation of test metrics. An experiment formulated in this way can arise not only in data science.

Before we show a real example with ML, let’s look at another toy experiment with an iteration generator.

At the first iteration function sequence will create generator with arguments specified in add_generator. At each iteration, the generator produces new item which will be used by power method. That item is stored till the next one will be produced so we can put it into the power using O('sequence') (just reminder about named expressions: it is the same as E(‘sequence’).output).

The number of iterations for each experiment is specified in run call. By default, it is equal to 1 if research contains only callables and None if it has generators. None is interpreted as infinity, and the experiment will continue until the generator in research is exhausted. Here, duration of the experiment depends on length.

We don’t use save_to parameter with sequence so its output will not be saved into results. As a result, we will have the following data frame.

Here id is the id of the experiment, columns start, b and length are for parameters from the domain, power is a column with the saved value (which we define as save_to).

Note that here we can see different number of iterations of the experiments. The point is that the generator will be exhausted at the second or the third iteration (depends on length). By default, all units will be executed at last iteration, and then the experiment will complete its work. At the last iteration, the generator's output will be received from the previous iteration.

The number of units (callables and generators added into Research are called executable units) is not limited. The order in which they are added determines the order in which they are executed at each iteration. If the number of iterations is not specified, the experiment will be executed until one of the generators in research is exhausted. Parameter when specifies the iterations when the unit should be executed. By default, all added units are executed on each iteration.

Here we add one to power output at the last iteration.

The parameter when specifies the period of execution (when=100 means that unit will be executed every 100 iterations) or exact iterations ('%5' is for the fifth iteration). It can also be a list of such values.

We have several rows for each iteration, but we can aggregate results withresearch.results.to_df(pivot=False):

Now we have columns name and value instead of several columns for each variable.

In addition to callables and generators, you can add classes to the experiment.

The instance of class MyCalc will be initialized at the start of the experiment with the name calc. Its attributes (callables and generators) can be used by names like calc.power.

Parallel experiments

All the previous toy experiments were too simple, so there was no need to execute them in parallel. But if you have heavier computations, you can speed up the research by parallel execution of experiments. Let’s emulate heavy workload by a callable:

The approximate execution time of that callable on my machine without research is around 40 s. We will run that callable twice in research by specifying n_reps. As we said earlier, Research can run experiments with the same config several times. In our case, the config is empty, but we still can use n_reps.

The execution time of the research is 1min 14s. To execute experiments in parallel, just define workers in run method.

Now we will have two parallel workers to run experiments within the research. And the full execution time is just 39.2 s which is twice as fast.

You can also specify additional worker configs that will be concatenated with experiment configs from the domain. For example, workers=[{'device': 0}, {'device': 1}] means to create two workers, and each worker will add to the experiment its own config with GPU device index. As for GPUs, you can just define workers=2 and devices=[0, 1] and each worker will start execution by setting an environment variable“CUDA_VISIBLE_DEVICES” to specify available GPUs (the first worker will get the device with index 0 and the second with index 1).

Results processing

Until now, we have always run researches with parameter dump_results=False. If we will run research dump_results=True (which is the default) we will find in a working directory a new folder with the name research. Its structure looks like that:

research
├── env
│ ├── commit.txt
│ ├── diff.txt
│ └── status.txt
├── experiments
│ └── 74a701b639792
│ ├── config
│ ├── experiment.log
│ └── results
│ └── power
│ └── 0
├── monitor.csv
├── research.dill
└── research.log

You will not have to work directly with the contents of this folder, except for the cases when you save some artifacts (for example, trained models and predictions). In general, you have three ways to save anything in research:

  • save_to parameter. The output of the unit for the current iteration will be stored in results with that name.
  • save method of research. It is a more flexible way because here you can save not only the output of the unit but anything that you can refer to by named expressions, for example, any attributes of added instances.
  • custom callables. Attribute full_path of the Experiment instance (can be obtained using the expression EP()) is a relative path to the folder with experiment results to be used as a callable parameter.

All results saved by the first two methods will be placed in the subfolder results of the experiment. During the experiment, execution results are accumulated and are saved on disk at the end of the experiment. Property df and load method of research.results will load all of them. Besides, methods load andto_df have parameters, and it is a way to filter results and load only what is required. Thus you can load results for specified iterations or configs. This can be useful if the results contain heavy objects, for example, prediction examples.

Note that pandas.DataFrame is just a way to represent results so you can store anything. This is very convenient since we can use the entire pandas functionality for filtering, aggregating, etc.

The rest of the contents of the results folder is required to load the Research object and obtain information about the state of the environment where the research was performed.

  • env folder stores the state of the git repository corresponding to the working directory
  • monitor.csv stores execution information
  • research.dill is a serialized Research object
  • research.log is a log of the whole research.

Besides, each experiment folder includes its config and its own log.

Roots and branches

Research can execute experiments described by one template but with different parameter configurations in parallel. However, sometimes loading and processing data (this is the case with CT images and seismic data) can take longer than training the model itself. If there is a common part in experiments with different configurations, it can be taken into a separate unit and evaluated once for several experiments.

Two experiments will be executed as follows:

The resulting data frame will be the following:

As we can see, stats in value columns for the different experiments are different. Now let’s add root=True to load_data callable and branches=2 to run:

In this case, we will execute load_data once for two experiments, and then its output will be used by mean units in experiments (branches) which will be executed in parallel threads.

Now experiments work with same datasets, so results are same for both. Note that root functions will be same for few branches with different configs. That’s why it’s essential not to use no data in root units that depend on config!

Instances added into research also have root parameter and can be shared between several branches.

Conclusion

Conducting many parallel experiments with resource allocation and complex domains of investigated parameters is not an easy task. To solve this problem, we developed Research, which contains some tools that make it easier to describe parameter domains, experiment plans, and collect results. Here we described how to

  • add callables and generators into Research,
  • get results,
  • define parameters domains,
  • run experiments in parallel,
  • make some callables common for several experiments.

In following articles we’ll go into more detail on how to use Research to perform experiments with models.

--

--