MLOps in Academic Research — Streamlining Machine Learning Experiments.

Published in

omi-uulm

8 min readAug 31, 2021

DevOPs

DevOps comprises a set of continuous software engineering practices for improving the collaboration between software development and IT operations. The focus is on the implementation of so-called DevOps pipelines, which are automated processes to simplify the production and deployment of software applications while ensuring a reliable, efficient, and standardized workflow. Moreover, both the number of potential errors and the production time are getting reduced. In a modified form, the DevOps principles are also applied to the machine learning domain. The goal of MLOps is to standardize data acquisition, data processing, model training and model evaluation through automated processes and therefore to streamline the lifecycle of machine learning applications.

Challenges

Besides the efficient realisation of cloud- and cluster systems, the institute of Information Resource Management (OMI) at Ulm University examines the cross-domain integration of cutting-edge machine learning algorithms. The research focuses on the prediction and synthesis of time series data as well as the optimization of decision-making processes. To increase the efficiency of academic research projects, an in-house DevOps pipeline for machine learning applications was designed. The challenges were not only to implement an application-independent automation framework, but also to consider the specific requirements of academic research. The following points formed the focus of attention:

Compliance with safety standards
Enabling a parameterized execution of machine learning models
Realize an automated deployment on high-performance servers

To integrate the automation pipeline into the institute’s existing project workflows, the MLOps principles were implemented using the DevOps platform GitLab. Moreover, GitLab comes with established features such as source code management, container registry and CI/CD infrastructure, making it more convenient to automate work processes.

Approach

The basic approach was to implement an automation framework which provides the foundation of newly created machine learning repositories. To set up the automation framework, the required credentials of the high-performance servers, such as SSH key, server address and access tokens have to be included. For security reasons, this operation is performed locally. The automation framework then creates the underlying folder structure of the GitLab repository as well as a list of all protected GitLab CI/CD variables used during the execution of the pipeline. Apart from the variables, no security-related information are stored in the GitLab repository. The GitLab CI/CD pipeline is further linked to the state of the GitLab repository, allowing users to manually decide after every incoming commit whether the GitLab repository source code should be executed. For the implementation, it was crucial that the results could also be obtained via the GitLab CI/CD pipeline.

Containerization

The first challenge was to design an automation framework capable of building parameterized machine learning models while satisfying the various compatibility requirements of the high-performance servers of the state of Baden-Württemberg. To provide machine learning models, which can be used universally, Docker images were chosen as deployment units, containing both the source code and all necessary software dependencies such as system libraries, system tools and the runtime environment. Generally, Docker is a software for isolating applications by using container virtualization. From the Docker images, so-called Docker containers are instantiated, each having its own file system, its own network stack, and its own resource constraints. Processes running within the Docker container are thus separated from the resources of the underlying host operating system, providing an efficient and software-independent way to automate the deployment and execution of machine learning models, especially for MLOps pipelines.

Processes running within a container cannot affect processes running in another container, or in the host system.

Since the automation framework should have been accessible to a large number of academical employees and students, it was necessary to ensure that no unauthorized access could be enforced on the GitLab Runner while executing the GitLab CI/CD pipeline. The greatest security risk was given by Docker itself, as the underlying Docker daemon requires root privileges to create and run Docker images. In addition, Docker containers are instantiated with root privileges by default, allowing unauthorized users to gain privileged access to the host operating system. To address the security vulnerabilities, Docker is replaced by the standalone, daemon-less, and unprivileged container image build software img, which also features a higher cache efficiency and can perform multiple build steps simultaneously. Since img inherited the commands from Docker, the transition from Docker to img could be accomplished without further problems.

To guarantee a parameterized and cost-efficient execution of the machine learning models, topic-related processes of the machine learning lifecycle were grouped, modularized, and saved as stand-alone Docker images. Generally, the GitLab CI/CD pipeline distinguishes between the data-related and model-related processing steps. In addition, a third Docker images is created containing the parameter configurations of the machine learning model.

After creation, the three Docker images are stored in the Docker container registry of the automation framework as well as as .tar.gz archive files in the GitLab Runner’s cache. By using .tar.gz archive files, Docker images can be submitted to the high-performance servers without exposing the credentials of the GitLab repository. The cache also ensures that the successive processing steps of the GitLab CI/CD pipeline maintain access to the Docker images, even if non-contiguous sections of the source code have been modified multiple times. In addition, users can download the created Docker images from the GitLab registry to run them in their local environment.

GitLab CI/CD Pipeline for creating a data image. By using the jess/img:latest, the image build software img is embedded.

Deployment & Execution

The deployment of the created Docker images to the external high-performance servers is automated by the GitLab CI/CD pipeline, using a two-factor authentication (2FA). In this way, the security of the high-performance servers is maintained while the error susceptibility of a manual transmission is reduced. To respond to the usage of varying container virtualizations, the GitLab CI/CD pipeline has also been designed to match the respective high-performance server configurations by converting the transferred Docker images to e.g. Singularity images if needed.

The benefits of Singularity images lie within the management of user privileges. In contrast to Docker containers, the user privileges within a Singularity container are mapped to the individual users of the high-performance server and cannot be overwritten by default to acquire the root privileges, which are needed to access security-critical areas. Furthermore, the Singularity containers are executed as sub-processes, which on the one hand eliminates the need for an additional daemon and on the other hand simplifies resource management. The latter enables the usage of GPUs to calculate complex machine learning models more efficiently. Alternatively, Podman images can be created, which serve a similar purpose.

In contrast to Docker Singularity images are just normal files on your filesystem.

For the computation of the parameterized machine learning models, the singularity images are executed sequentially, starting from the parameter image. The parameter image includes one or more parameter configurations in a JSON file. For each containing parameter configuration, a separate environment file is created, which is then saved to the mounted project directory of the high-performance server. In a next step, the data image is executed, which, depending on the implementation, either contains a previously created dataset or obtains and processes a new dataset during execution. Like the parameter image, the dataset is stored in the project directory for further processing. Finally, for each stored environment file, an independent instance of the model image is executed obtaining both the associated environment file and the required dataset via the project directory of the high-performance server. While the environment files are used to instantiate the environment variables of the Docker containers, which in turn are linked to the parameters of the machine learning model, realizing the parametrized execution of the Docker containers, the machine learning model is trained by fitting the dataset.

Documentation

Moreover, to keep the manual documentation of the experimental results as simple as possible, both the real-time tracking of the computational progress and the subsequent evaluation of the experimental models is automated by the integration of the machine learning platform Weights & Biases. Especially by using artificial neural network models, Weights & Biases offers a simple but efficient way to visualize the learning progress of different experimental setups within a central dashboard, allowing scientists to share their results among colleagues and students. Furthermore, a variety of neural network architectures can be both created and evaluated by using the automated parameter permutation, providing a more accurate conclusion about the best possible architecture of an artificial neural network model.

Performance of different parameter permutations, executed with Weights & Biases.

Results

Finally, by submitting a GitLab CI/CD hook, the results of the parameterized machine learning models are copied from the high-performance servers and stored as job artifacts in the repository of the automation framework. The source code management of GitLab also provides the administration and documentation of each experimental run, allowing scientists and students to reproduce the results based on the changes in the source code.

Conclusion

The implemented automation framework streamlines the execution machine learning experiments by running an in-house GitLab CI/CD pipeline. Besides the importance of MLOps principles, the pipeline fulfils the demands of academic research, especially in data security and deployment of machine learning applications. Even though the workflow solves only one specific problem within academic research, the automation framework shows the potential of MLOps and DevOps approaches: By manually running the GitLab CI/CD pipeline, the complete process of containerization, deployment and execution of machine learning models on the high-performance servers of the state of Baden-Württemberg, as well as the subsequent retrieval of the experimental results is automated without any intervention of a user.

The resulting time savings and data security boost both the academic research and the scientific exchanges between scientists and students.