Migrating to Git — A Piece of Cake Or a Full-blown Engineering Project?

Published in

MANTA Engineering Blog

12 min readJun 6, 2022

How It All Began. Do We Really Need Git?

Before migrating to Git, we had been using Subversion to store all our source code for more than a decade. SVN did its job well and accompanied us on the path from a very small experimental tool to a commercially successful automated Data Lineage Solution. However, discussions about having all our code versioned in Git were becoming more frequent. Our developers had been asking for it (well, some of them). Git is popular among the majority of our potential colleague developers out there and we should stay competitive. Google trends advocates Git, and Git also takes the biggest share of the market. Most importantly, MANTA is a leading technology in the data lineage space, so we are continuously looking at which tools to adopt and which to abandon.

Personally, I am not a fan of heedlessly experimenting with the shiny allurements of the latest trends. While it’s important to adapt and innovate, the “latest and greatest” trend can sometimes overshadow the importance of established principles and practices, such as good SW architecture, design patterns based on OOP, CI systems, static code analysis, a well-defined software process regardless of the selected methodology, automated tests, etc.

That said, there are many things I love about Git, such as easy branching, fast commands execution, usage of pull requests, integration with numerous tools and systems, powerful scripting possibilities, and the ability to scale to support our growing team of developers.

These benefits are the low-hanging fruit. Besides migrating to another source code management system, there is another dimension — the overall tuning of our development pipeline.

Planning Of the Migration To Git

Once we started thinking about the migration we revealed more and more complexities. We found out that it will affect the structure of the repository, branching model, build system, dependency management, continuous integration, release process, daily development routine, code review, etc. It was already clear that we were not talking about any “one script does it all” task which can be finished over a weekend. We needed a plan to make sure we all understood the what, who, when, and how. The plan contained a work breakdown list of approximately 100 tasks comprising 200 MD of work across several teams.

Do not underestimate planning. Your boss’s first questions will probably be: “How much will it cost? How many people will you need? When will you be done?” I admit I was thankful that the VP of Engineering was absolutely on the same page and we approved the plan while understanding it was a real project with all its phases, prerequisites, schedule, and dependencies. Artificially pushing down the costs to get people’s vote usually leads to even bigger disenchantment in the later phases of the project.

Here we were, a little team of three engineers and a plan for the migration of the whole codebase from SVN to Git. We worked out a list of activities needed to finish the whole migration:

Craft a migration plan to streamline the final migration
Rollback plan or plan B if things go wrong
Communication with the development team about the planned changes that will affect their work
Analysis is divided into several parts (we will get to it in the next chapter) describing the impacts and the technical solution
Acceptance criteria to make sure we did not miss anything important and the final migrated codebase is complete, not corrupted, consistent
Training to ensure everyone was familiar with even advanced Git commands and techniques
Configuration and administration (authorization, authentication, access rights, user groups, establishing repositories, branch settings, disabled force push, protected branches, notifications)
Infrastructure (HW, SW, hosting, RAM, HDD, CPU, monitoring)
Test environment (Git, Jenkins, SVN, everything backed up)
Implementation (migration scripts, impacts on the build tools, utilities, CI, release scripts, Git hooks)
Be prepared to run the migration procedure repeatedly before everything is perfect. The migration will probably require manual steps.
Install the local development environment (all developers will need to install Git, configure it, generate SSH keys, clone the repo, set up IDE, and migrate work in progress manually).
Baby-sitting: after-migration support, solving issues, helping others, fixing broken builds, new requests, configuration, performance tuning, monitoring
Initial lower efficiency — the developers will need some time to get to full speed
Time for some overhead activities
- Meetings with the migration team and the stakeholders (architects, DevOps team, managers)
- Consultations
Documentation of the development routines (specific branching strategy, release procedure, changes in CI, changes in code review, impacts on existing utilities, installation instructions)
Buffer for the unknown. I’d recommend a 20% to 50% risk buffer, depending on how brave you are and how established you are with your team and environment.

A simple Gantt chart of these activities showed the possibilities of parallel activities, their dependencies, and the expected schedule.

Analysis

Knowing there would be so many important topics to consider well, we decided to proceed as if we were developing a piece of software. Before jumping into implementation, we needed to make sure all relevant solutions and their impacts, pros, and cons were considered as well. Each topic was analyzed, and all assumptions were approved by the group of architects and selected team leaders of the development teams.

Monorepo Versus Many Repos

Our codebase resided in hundreds of independent SVN repositories and a few existing smaller Git repositories with their own commits history. Additionally, each repo was structured into the trunk, branches, and tags. Our task was to decide if we shall merge all product-related repos into one monorepo or go with many little repos independently.

Initially, we gathered all possible requirements on the repository structure and how we wanted to work with it. Some questions we asked ourselves included:

Can new projects be added with ease?
How do we restrict user rights to grant access to certain user groups to certain repositories or parts of a repository?
How fast will the clone be?
How could the checkout of a branch in a monorepo influence the IDE? Could the reindexing of the whole project in the IDE cause its crash?
How will we refactor a code spanning several modules?
How to merge the history of the individual SVN repositories into a single history log?
How to merge the existing Git repositories into the monorepo while persisting their history log?
Shall we retain individual module-related release branches or merge them into one big single branch?
What do we do with the historical module-related tags?
How would the code review be structured in a monorepo / multiple repos?
How to merge the changes atomically?
How shall the experimental projects be treated?
How shall the folders be restructured?
How to migrate SVN commit ID to Git commit log?
How to transform svnignore files to gitignore files?
What to do with the deleted files and folders?
What to do with the files stored aside from the standard trunk-branches-tags structure?

The conclusion of the analysis was: “The fewer repos we maintain, the fewer problems we will have. Especially releasing and inter-module development will be easier.” It showed that scaling of one single Git repo is not a problem. Git clone speed was 20 minutes compared to several hours in SVN. The final monorepo hosted all application modules that are to be released together and their versions should be kept in sync. We also created additional twenty small Git repositories dedicated to the utilities and experimental projects.

Selecting a Proper Branching Model

Now that the structure of the repository was defined, the next step was to lay out a branching strategy. To start with, we adopted the widely used Git-flow model. It was a huge step to move away from the trunk-based development to more complex but powerful feature branches.

Inspired by https://nvie.com/posts/a-successful-git-branching-model

However, this model does not follow up with a software product like MANTA which is deployed and maintained in several parallel production versions. Also, tagging must be done differently to address a different release process. We also practice a so-called “minor release” which utilizes a slightly different branching and merging procedure.

We reshaped the given model to respond better to our needs:

Only 1 permanent integration branch (develop)
No master branch.
There can be several parallel release branches. All tags are placed on the release branches.
A feature branch is merged to develop using a switch — no-ff to create a merge commit for evidence and easier cherry-picking of a single merge commit.
All development (features and bugfixes) is done via feature branches (as short-lived as possible).
Feature branches branch off from the integration branches (develop, release branch).
Regular pushing to the origin, voluntary commits squashing, forbidden force-push, and regular rebasing of feature branches on the head of the integration branch.
We use pull requests and all code must pass a code review.

Continuous Integration

Our branching model was completely redefined. The workflow consists of the feature branches merged as pull requests. In contrast to the past, we wanted to incorporate the release branches into our CI to be able to package and test the new version heading to production.

We used Jenkins Maven jobs to continuously integrate the changes in the main development branches such as develop and release branches. The chain of the dependencies is represented by upstream and downstream Jenkins projects based on the maven configuration (pom.xml files). The build of one subsidiary module triggers the superordinate jobs in Jenkins.

Some types of branches, such as feature branches, are created on the fly and are automatically tracked in Jenkins. We used the “Multibranch Pipeline” and Git plugin to notify Jenkins about recent changes. The pipeline is defined in the “Jenkinsfile” describing the build behavior for different types of branches and separate stages of the build. In fact, we build only the pull requests, not the underlying feature branches.

Having more than 150 required a certain level of automation. We call it Jenkins utility and it simplifies bulk operation on the Jenkins jobs such as creating, pausing, cloning, etc.

If you are curious about the insidious details, check my blog post about CI and a multi-modular monorepo. It elaborates on the more advanced topics of:

When to use regular polling
How to detect and build only the changed modules in a single PR
Managing direct and transitive dependencies
Avoiding conflicting versions of artifacts in different branches
When to deploy the artifacts into the Nexus repository
Git sparse-checkout to fetch only sources in the affected parts of the repo

Build and Release Process

This was a real challenge. Remember, our code-base is fragmented into one thousand modules which form a complicated net of dependencies. These, however, are assembled as a few end products (applications) and distributed via an installer wizard and docker images in a public Nexus repository. The release process is done centrally for all modules. We support three kinds of releases: major, minor, and hotfix. Each has its unique characteristics but also some common features that were automated and simplified. And we were not allowed to break anything.

In the middle of it all stands the Maven release plugin wrapped in our automated script and extended with a custom Maven release strategy. The tricky part was to do the versioning schema right. More specifically, assigning the right (SNAPSHOT and release) versions to all modules (automatically) in different Git branches while the branches are born and die and are merged and artifacts and their aggregate modules are released and re-released when something goes wrong.

Our release process is far from perfect. We knew that before the migration and we know it now. However, we had to descope several improvement ideas from the migration backlog to finish in a reasonable time.

Migration Script

To be able to streamline the migration itself required a maximum amount of automation. The following flow chart depicts distinct parts of the migration shell script.

The flow diagram depicts the steps of the automated migration script.
The left branch represents the original SVN repo on the input.
The right branch shows how the original smaller Git repos were transformed into their raw (bare) format to enable fast processing. Additionally, tags and release branches were given a unique name to reflect the module name.
The original SVN repository was cloned (checked out).
Mapping of SVN commits authors to new Git user accounts.
The rules for the migration were captured in the *.rules files. They consisted of
- List of all SVN modules
- Mapping of original SVN addresses to new addresses in Git.
- Mapping of tags and branches and folders.
“*.rules” were used by svn2git utility to transform the SVN repo into a single Git repo. We slightly customized svn2git to use our merge strategy and enrich the commit message with the original SVN revision ID.
The history of commits has been cleaned.
The first test suite was triggered to verify the correctness and completeness
- The content of develop branch corresponds to the trunks of the individual modules
- No files or directories can be missing or be redundant
- All feature branches from the original repositories should be present
Then, the main monorepo was merged with the existing smaller Git repos via the ToMono utility.
The individual modules’ release branches were post-processed (merged) so that every MANTA major release was represented by a single release branch common for all modules.
The final post-processing consisted of several steps
- Maven aggregation pom.xml was added to the root of every branch.
- Jenkinsfile was added to the root of every branch.
- CRLF settings in .gitattributes
- One central .gitignore constructed
Additional test suite T2 is triggered.

The Final Migration Weekend

After months of preparations, and several dry run migrations, we approached the “D Day” of the final weekend migration. I can’t stress enough the importance of the dry run migrations we undertook to fine-tune the migration script and the manual steps of the process. We tried the usual development scenarios with full integration with other systems such as Jenkins. We had to be sure everything would go smoothly and on time to ensure migration would be completed over a single weekend.

Finally, everything was in place and ready to start:

Detailed migration plan with clearly described steps and responsibilities including a time schedule
Production environment with sufficient resources (HDD, CPU, RAM)
All systems and tools installed and pre-configured (Git, Gitea, SonarQube, Jenkins, Nexus, etc.)
Repositories prepared and user groups configured
Migration scripts ready and tested
Documentation ready to be published in Confluence
SVN backed up
Acceptance criteria written down
Jenkinsfile implemented and pom.xml changes prepared
Jenkins utility ready to ramp up all 150 Jenkins jobs pointing on Git repo using sparse checkout
Informative emails and user instructions written in advance
Developers trained in advance

Despite all the preparations, we were under time pressure and made a few mistakes. The migration was successful on the third try and we finished literally 5 minutes before midnight.

Avoiding Trouble

Let’s briefly mention the challenges and risks we were facing or those we were afraid of. Some of them were completely (luckily) avoided, and some of them were only mitigated partly. Some of the issues were clear from the very beginning, some pits were hidden deeper under the surface waiting for us to fall into them. When undertaking a migration, consider the following risks and possible mitigations:

Final Words

Migrating to Git was preceded by careful consideration of whether we really needed it and if we actually wanted it. We wanted to take advantage of the powerful scripting possibilities and Git’s ability to scale to support our fast-growing team. Apart from switching the versioning systems, we substantially changed our development process which proved even more challenging and time-consuming.

However, what helped us to manage it, was taking an approach to a regular software project with all its requirements, phases, risks, people, resources, milestones, schedule, issue tracking, and testing.

In the end, the project took one year (twice as long as expected) to finish and we spent 300 MD in total.

Finally, hundreds of SVN modules were transformed into twenty-five independent Git repositories dominated by a multimodule monorepo hosting eight million lines of source code. We were able to preserve the history of commits and we utilized an enhanced branching model. The developers can enjoy more granular units of work in the form of pull requests undergoing a code review and a regular continuous integration. The release process was centralized and access rights to the repositories were simplified.