One massive TFS to VSTS migration

From TFS2015.2 to VSTS in one weekend.

This easter Jasper and I have completed one massive TFS to VSTS migration with the TFS Database Import Service. The migration was a bit more complicated than usual because the environment was not up-to-date at all and the upload speed was not that high to say the least.

The legacy

TFS 2015.2.1 on Windows Server 2008 R2 SP1

SQL2012 on Windows Server 2008

As you can see the setup is quite old and requires some serious upgrades before being ready for import. Our ultimate goal was to migrate to VSTS with the least impact for end users possible, meaning no intermediate upgrade of TFS and no multi day downtime during work days.

Route map into VSTS

Because of these limitations we created a map in on beforehand with all the different routes that we could take to move the data into the cloud. We discussed with the team all the possible pros and cons of each route to come to a decision.

Due to the database size, the network capacity and the desire to move over in only one weekend we ended up with only one feasible route. But first show you what made this migration so special.

The numbers

One Team Project Collection database of a whopping ~500GB (look here for size details and query) containing:

  • 78 Team Projects
  • 730 Teams
  • 896502 work items
  • 1700+ vNext build definitions
  • 2600+ XAML definitions
  • 900+ vNext release definitions

On a flaky internet connection with a network speed varying between 10mbps and 50mbps, a quick calculation reveals that it takes 22h at least to copy that beast into the cloud. So we came up with a route to cram al the hard work into only a (somewhat longer) weekend.

Voting we did with the team on all the pros and cons for each route

The route

To minimize downtime and maximize throughput we came up with the following strategy:

  1. Prepare an Azure VM with TFS2017.3/TFS2018 and SQL2014/SQL2016. We used this VM for several upgrade tests and dryrun imports before starting the actual import.
  2. Then we went for a full backup on Monday and copy the bulk of the data into the cloud followed by an incremental backup on Thursday to copy the status quo. With this approach we managed to reduce the actual wait time (cq. down time) to only a couple of hours instead of multiple days.
  3. On that VM we configured an Application Tier to upgrade the database and run some pre-migration scripts to prepare the import. We needed to stop the identity sync job as soon as possible to prevent from loosing identities!
  4. From thereon we queued the import and waited for Microsoft to pull our data into VSTS. In our case it took about 10h for the dryrun imports to complete and about 13h for the actual production import. I have no idea where that difference comes from. Maybe it has something to do with business hours or replication.
  5. As soon as the account was available we started to execute a handful of post-migration scripts we crafted upfront to recreate artifacts that are not migrated by the import and to repair items that work differently (the price you pay for postponing updates).
  6. Hook up old agents to bring the build and release pipelines back to life. We chose to do a migration as-is first and prove what worked before still works. From thereon we can introduce new available features and improve on existing parts.
Picture we used to communicate progress and lead time

The tools

For the copy we used robocopy to have a reliable copy over an unreliable channel.

Backups were created and restored using Redgate SQL Backup Pro. This was beneficial because it a) is the tool used by the DBA team, b) chopped some serious megabytes from the backup thanks to better compression and c) was able to create encrypted backups (which helps in a highly-regulated environment).

Despite the DBA team having licenses we did not. Probably we could’ve used the licensing mechanism to temporarily activate the server and deactivate it later but we got away with the (extended) trial period.

PowerShell. Lots of it. We crafted some handy-dandy scripts to automate the migration largely and repair or recreate things. I will dedicate a separate story to that.

The caveats

From the dryrun imports we learned about several caveats that we mostly were able to solve with a bit of scripting. With such a large user base there will always be the unexpected but we got almost everything smoothed out during the migration weekend.

Identities

When restoring an Application Tier that was domain-joined to a disconnected VM in Azure you are practically combining an environment based move with a hardware based move which is normally not recommended.

The danger herein lies that you start losing identities as soon as the scheduled sync job kicks in. Luckily this can be disabled and we altered that script a little to stop this job as soon as the services are online.

Agent pools

Agent pools and queues are not imported (yet?). Therefor all your definitions are bound to a non-existing queue after migration. You can fix this by hand, but if there are that many you probably want to script that. We did that by exporting all pools and queues from TFS, recreating them in VSTS and update all definitions by mapping the old queue id (which is luckily still present on the imported definition).

Build artifacts

One problem we ran into were release definitions that used the build definition name variabel to construct the path to build drop artifacts. Formerly that name was the original build definition name that would always be the same as the build source alias. Nowadays the variable reflects the actual build definition name which can be different from the build source alias. Therefor we came up with a script to synchronize these two again.

Tasks

We had several tasks that required reconfiguration or needed to be disabled. For example the old version of the npm task had the command and arguments in a single input field where the lowest version possible on VSTS required it to be split. One test task was not migrated and after install had a different version so we needed to update all build definitions to match that number. For that we had scripts to batch update all definitions.

Git repository as private npm registry

Due to the absence of package management several teams used a git repository as a private registry. We got this working at the dryrun in VSTS but ran into difficulties with the production import. Ultimately we brought package management as the solution to this, but I rather would’ve put this outside the migration.

The scripts we created will be shared in a (near) future post.

The wrap up

This strategy for migrating one gigantic and heavily used TFS collection database worked for us. Beware of the caveats, start scripting in advance and plan for a a few days downtime to smoothen things out. Make options explicit by drawing pictures and make decisions collectively.

This is our story. Shout out if it was helpful to you, or if you want to know more about it.