Subversion to Git migration

In a company, we had to migrate a huge Subversion repository to Git, with multiple Git repositories. I want to share the reasons, the approach to migration, and lessons learned from it.

Reasons to move

  • A finer logical and physical boundary was needed to have better code structure for each project (less spaghetti).
  • Subversion was treated as a place to save some files and its features were minimally used. Due to this the Subversion repository had moved away from an organized projects’ repository to somewhat a personal filesystem and it was easy to get lost in it. This was possibly due to the Software Entropy (Broken windows theory?). Needless to say, the requirement for documentation outside of code increased.
  • In a large team, with multiple developers touching the same parts of the codebase, committing can be a hassle if it is all done on trunk. Fear of overwriting someone else’s changes was real.
  • The giant repository made it tough to write testable code, along with writing tests and automating test running. Since, it was easy to include dependencies via filesystem path, the modularity started to decline. To get to the place where writing tests had to be made easy, modularity was required.
  • Automated testing and deployments required more robust workflow and flexible system which could be achieved by Git using its easy branching.
  • There were issues with conflicts while waiting for current release. So, if using Subversion was producing conflicts, then theoretically a decent Git workflow should be beneficial.
  • There was minimal code review and a lot of silos. Despite a few developers working on each project, the communication about the code and possible issues was minimal because of poor code visibility. Theoretically, Pull requests should solve this issue, which is a standard feature in any Git hosting server like Github, Bitbucket etc.

The task

  • Move a giant repository with about 20 top-level directories into multiple Git projects by first identifying the projects hidden within these directories.
  • Some directory and file movement was needed due to inter-dependencies. Hence, projects required picking and choosing from various directories in the SVN repository.
  • The boundaries of projects were confounding and that needed to be clarified.
  • Preserve the history, as much as possible.

The process of the migration

  • Identify the projects in the large Subversion repository and its participants, developers and watchers.
  • Create empty repositories on the Git server for each of those identified projects, followed by adding the watchers and developers to the settings and permissions of each new Git repository.
  • Freeze the Subversion repository, so no new changes can make their way in.
  • Fetch all the authors and their emails who contributed to the Subversion repository. In our case, a small script was used to fetch that information from our LDAP server into a file called authors.txt. We decided to keep all of the history.
  • Feed the list to git-svn. This may take some time, depending on the size of your Subversion repository. We had to wait a couple of hours because, well, history.
git svn clone --authors-file=authors.txt \ https://svn.yourserver.com/svn-repo git-repo

This command creates one giant Git repository, a replica of the large Subversion repository — a Monolith.

  • If needed, map a different source code structure, by slicing and dicing the large Monolith, to its own project. The master branch already has the all the files and directories needed. Take advantage of Git branches and map each branch to a remote of the same name. You already have the empty remotes. Create a branch for each new project repository in the large monolith.
git branch project1 
git branch project2

git branch projectN
  • Now that you have branches, match them to their proper remote. In our case, we kept the remote names and the branch names same as their respective project repository names for consistency. Example, project1, project2 … projectN etc.
git remote add project1 git@server://link.to/project1.git 
git remote add project2 git@server://link.to/project2.git

git remote add projectN git@server://link.to/projectN.git
  • In each of those branches, hack away the stuff not needed and rearrange the new code structure per needs.
git checkout project1 
//…make changes to the branch like mv, rm…
git add .
git commit -m “setup project repo for project1”
  • Once committed, push that branch to its corresponding remote repository, to the master branch.
git push project1 project1:master 
git push project2 project2:master

git push projectN projectN:master

Please note that the first projectN is the remote name and the second projectN in projectN:master is the branch name in the Monolith, you currently have checked out.

  • New repositories have been created!

Lessons

  • Remember that Git has a learning curve. Make sure that your team members who have never used Git before have decent understanding of Git. Allow them time and provide support. No one will understand Git unless they have built some muscle memory for it. There will be a few Gitastrophes (term not coined by me) before everyone is on the same page.
  • Pay close attention to the team members who may not have used Subversion either for some reason and have a different understating of a VCS. They need extra attention and explanations, not only of Git, but also of other VCS in general. Mitigate these issues by recurring training sessions, tips-and-tricks emails with bite-sized posts, internal/external documentation and direct support from members who know Git.
  • Git keeps a track of file moves by keeping a track of its contents. If the simple history command like git log does not work, choose to use “- - follow switch in git log to run the history with heuristics.
  • If the team uses Linux as well as Windows, then you will have issues with line-endings and may need to specify a .gitattributes file with related settings. Adding the following worked for us —
* text=auto
  • Choosing a single workflow does not work for all projects but keep at least some parts consistent. For example, we decided that Pull requests had to be a part of our workflow for every project.
  • Git gives more freedom to work with Continuous Integration tools like Jenkins and TeamCity. A build on CI server that runs on all commit+push for all branches to test the code continuously is worth gold. It ensures that the developers have early access to possible failures and errors.
  • It is easier to focus on tests once each project is more modular and has its own repository. Of course, there has to be some extra effort in this direction. However, this facilitation of test-writing goes hand-in-hand with the Continuous Integration. It is a double win!
  • Pull requests provide more code visibility. It’s not a full-fledged code review, certainly not for all Pull requests, but it is a great way to start.
  • Try to stay away from Git submodules. They create overhead, in my opinion. Try other methods of dependency management.
  • If your project’s large history bothers you, use shallow clone. In Git version 1.9 and above shallow cloning has been greatly improved.
git clone — depth depth remote-url

Thanks for reading this post.


Originally published at github.com.