Mono-repo for model delivery — Pros and Cons

Published in

Thomson Reuters Labs

6 min readApr 13, 2023

There are two general approaches to code repository structure for machine learning model development — multi-repository or a mono-repository. A multi-repository separates code by some facet, say “area of concern”. Typically, we have been using a multi-repository approach to model delivery. There are pros and cons to both approaches and this article aims to discuss these.

Current Approach

Here is how one TR Labs project structured its code across separate GitHub repositories:

Experimentation — This contains all the experiments, trials and notebooks that researchers have worked on to train/develop the best model for a task.
Re-training of models — Contains code for creation of a training pipeline on SageMaker for model re-training.
Inferencing — Involves code, scans, tests and CICD pipelines that are required to package and deploy the model as a docker image to ECR before it can be consumed by the App teams.

However, this process introduces several gaps when it comes to maintainability of the project.

Multiple repositories for a given project to keep track of and maintain. As the model changes both the model re-training and inference repos need to be kept in sync which can be challenging. Care needs to be taken to incorporate all changes into all 3 repositories.
Dealing with common code that is shared between the above 3 repositories. Common code between the experimentation, inferencing and re-training repositories often needs to be shared. This typically has resulted in there being a separate repository for common code which needs to be maintained causing additional overhead.

Tooling

TRLabs has tooling that provides scaffolding around AWS and Sagemaker. MLTools-CLI is a tool that is a wrapper around AWS Sagemaker experiments that helps generate boilerplate code for training and processing jobs. The tool also helps provide structure to projects and environment isolation to individual experiments.

Python Package Template — We use a python cookie cutter template to set up our python projects. It establishes directory structure (separation of concerns), as well as code automations such as style, type, and other pre-commit checks.

MLTools-CLI — During research work for a particular project, typically several experiments and trials are performed in Jupyter notebooks. This makes it hard to later isolate the code for a particular experiment/trial. MLTools-CLI provides code scaffolding that helps organize experiments and trials into their own folders which map to corresponding locations on S3. It also encourages writing scripts for training and processing jobs by generating boilerplate code. Based on the scripts the tool can also generate the Sagemaker model re-training pipeline.

Mono Repo approach

Over the past year, Labs has been working on a new project layout that allows experimentation, training and inferencing code to reside all within a single repository — a Mono-repo. The mono-repo solution aims to understand if the above 3 repositories can be combined into a single repository for researchers and engineers to collaborate on together.

To test feasibility of this approach, we used an existing multi-repository solution consisting of experimentation, model re-training, and inferencing repositories.

The first step was to see if the Experimentation and Model Re-training repos could be combined using Sagemaker experiments. The use of MLTools-CLI (see Tooling) made this possible. Once the experiments and trails are organized into scripts and a defined folder structure using the tool, a model re-training pipeline on Sagemaker can easily be generated. The tool provides functionality to generate a model re-training pipeline by referencing one or more scripts used for pre-processing, training, evaluation and/or inferencing of a model.

Secondly, we needed to understand if researchers and engineers could work in the same repository (Inferencing), which is used to deliver the model. The challenges here were ensuring python virtual environment isolation and sharing common code between engineers and researchers.

Proposed Project Structure

The top-level project is created using the Python package template (see tooling section). This level contains the code required to package and deploy the model as an endpoint as well as common code that is shared between engineers and researchers.

Engineers only work at the top level of the project. This level is subject to quality and security scans as it contains production facing code.

All researcher experiments and code are contained within the folder “research”. This level is also created using the Python package template. Files at this level are not subjected to quality scans since they are not deployed in production. It contains an experiments folder wherein each experiment is its own virtual environment. The MLTools-CLI tool (see tooling section) assists in the generation of this structure.

Migration of researcher experimentation repo with history

As in the case of several projects at TR, the researcher creates their own experimentation repository before the engineer joins the effort. In this case the repository needs to be migrated into the inferencing repository’s “research” folder by following the steps below:

Prerequisites

Install git-filter-repo (see warnings about using git’s filter-branch:

pip3 install git-filter-repo

Steps

To import source_repo/sub/dir to target_repo/sub/dir:

Clone the desired branch of the source repo to local

git clone {source_repo URL} source_repo_clone --no-local --branch main 
cd source_repo_clone

2. Remove remote reference to avoid accidental changes to original source repository

git remote rm origin

3. Rewrite history to include only entries related to sub/dir

git filter-repo --path sub/dir --force

To import the whole source repository, use this filter instead

git filter-repo --to-subdirectory-filter {destination dir} --force

4. Merge into the target repo

cd ../target_repo 

git remote add source_repo ../source_repo_clone/ 

git fetch source_repo main 

git merge remotes/source_repo/main --allow-unrelated-histories 

git remote remove source_repo

Branching strategy

For engineers and researchers to collaborate on the same repository, the branching strategy needs to be clearly defined.

The branches in green and yellow (shown above) are owned by the engineering team. The branches in red are owned by the researchers.

An Engineer in the mono-repo only works within the below 3 branches

The main branch — used for releases of new versions of the solution to the AI Model Registry
The development branch — used for continuous integration from the feature branches for building and deploying the Docker image containing the packaged model to Labs ECR Registry
Feature branch(es) — One or more feature branches created off “develop” branch.

Researchers will typically work off the two branches as described below:

The experiments branch — which is created from and kept in sync with the engineering development branch
One or more feature branches created off the experiments branch

Figure 4 — Researcher branching strategy

Proposed workflow

Engineers work only at the root level of the project and commit to folders at the root level and any folders within src. Researchers on the other hand work only within the research folder and use MLTools-CLI (see tooling) to create and track experiments. The researchers work off the experiments branch which is always kept in sync with the develop branch. Researchers create new feature branches based off the experiments branch and merge it back into the experiments branch when done.

When a particular experiment is ready to move into the develop and main branches, the engineer will create a new PR off the experiments branch. If any code needs to be extracted into the common folder under src, the refactoring is done by the engineer prior to merging the PR into the develop branch. When it’s time for a release of the model artifact, the engineer creates a PR off the develop branch into the main branch.

Conclusion

By co-existing with researchers in the same repo for a given project, it reduces the maintenance overhead of having multiple repositories. All the code for a given project be it for inferencing, CICD, model re-training and experiments co-existing in the same repository makes it easy for future enhancements and refactoring.

In addition, keeping common code shared between researchers and engineers in the same repository has benefits in reducing the number of packages that need to be built as well as avoiding the need to have multiple licenses for a given project (for security/docker image scans).