TeamCity Upgrade

Published in

reachnow-tech

11 min readOct 2, 2019

Introduction

Upgrading internal services can be as challenging as that directly used by customers. Many engineers depend on tools such as CI being highly available, making service degradation costly, especially when the company offers flexible working hours (as REACH NOW does) and/or spans time zones. This blog post follows of our DevOps engineer Ben Tisdall’s daily routine on upgrading our TeamCity CI running on AWS, back in Summer 2019. We hope this post will be of help to others performing similar migrations.

Background

We’ve been using TeamCity as our CI system since an evaluation of various options in mid 2016. While we were broadly happy with our TeamCity setup there were several things that needed to change:

We’d hadn’t taken advantage of major version upgrades and were now two off the pace. Not only were we missing out on new features but our version was no longer officially covered by our support contract!
Many of the system components were not created using infrastructure code.
Since TeamCity’s introduction we had moved to a multi-account AWS model and this system was designated to be relocated.

Diary

Day 1

Today, I started by looking at the infrastructure code previously written by a colleague for this project. It was sound, but I decided to move the Elastic File System (EFS) code out into a separate template — this will promote faster testing cycles, cleaner and more comprehensive templates and, in unhappy cases, reduce blast radius.

Day 2

We now have an Amazon Machine Image(AMI) with latest Amazon Linux 2 and the latest TeamCity build. I’ve also automated the database user setup by via a VPC-bound Lambda that that’s invoked by our CloudFormation wrapper script, accepting the endpoint and other pertinent details via stack exports together with secrets from parameter store and using this information to create the empty teamcity database, the “teamcity” user and password.

Today I also created a new GitHub repo for the infrastructure code. Large, long-running feature branches are a bad smell and since we want to keep the old infrastructure running in parallel it makes sense to fork the code. Large teams probably don’t have the luxury of doing this!

Day 3

Today worked on improving the code, with a fair bit of refactoring and adding some deployment wrapper scripts. I also started building a TeamCity v10.0.3 AMI for use in the new infrastructure context (the current live version not being suitable) so that we can easily test the upgrade and rollback process. I also started to think about the migration process itself.

Day 4

I’m thinking out loud here about the decision to go with VM-based deployment as opposed to running on ECS, which is our standard way of running workloads. Some background: I took this project over from a colleague that left the company. The engineer previously working on the project had favoured ECS but our TeamCity deployment uses EFS as persistent storage and it had been concluded that this was not a good fit for use with ECS. My colleague had hoped to switch to EBS but had decided this wasn’t feasible, perhaps because he wasn’t aware of the ECS Rexray integration or because had examined this rejected it. In any case, our current setup is reliable and performant and since we’re not only upgrading two TeamCity major versions but also moving everything to a different AWS account in the same operation not trying to change other things at the same time seems like a good idea.

Would we do it differently if we were starting now? Maybe, but we’d probably want to keep it on separate infrastructure anyway, so there wouldn’t be any savings on a dedicated ECS cluster vs TeamCity on a VM. At any rate we have 3 years’ experience running TeamCity on a VM — once the migration is done then changing the platform while keeping the TeamCity version the same would be a far easier operation than what we need to do here.

Day 5, 6

I spent these days familiarising myself with basic PostgreSQL administration.

Day 7

Today I’ve started to develop the migration process. The first phase involves spinning up a TeamCity 10.x environment, creating a configuration and then doing whatever’s necessary to migrate that to 2018.x. As I go along there will no doubt be some repetitive work that should be automated — as with any automation although it imposes some upfront costs it will repay as we iterate the migration steps.

Upon initialisation of a new database via the UI, TeamCity helpfully prints out some optimisation tips; these will be incorporated into the final RDS parameter group for the DB instance.

Day 8

Today I started performing tests on database backup/restore and looking into filesystem backup/restore. Along the way, I created some infrastructure code to spin up a migration instance in the same VPC as the TeamCity server for faster connectivity during backup/restore operations.

At some point I noticed:

$> df -t nfs4
Filesystem 1K-blocks Used Available Use% Mounted on
eu-west-1c.fs-34926efd.efs.eu-west-1.amazonaws.com:/ 9007199254739968 5394827264 9007193859912704 1% /mnt/teamcity

Oops, we have over 5TB of data on EFS! This makes backing up and restoring quite a mountain to climb (it’s also very expensive compared to S3), let’s see what we might be able to prune…

Learning: the S3 artifact plugin we were using creates a duplicate store of artifacts in the specified bucket. This plugin will be disabled in the new environment and we’ll switch to the bundled S3 artifacts plugin which is able to replace the default artifact store within the TeamCity data directory.

Day 9

Yesterday, thanks to a colleague from an Android app squad, we got to the bottom of where most of the disk usage was occurring — it turns out that at some point Android builds were profiled and every build was saving a 1.3G .hprof file as an artifact! Profiling has been turned off now and some cleanup rules put in place. A find for hprof files is now running, the output of which will then be used to clean them up. We should give more thought to cleanup policies in future. Today I’ve been writing database backup and restore scripts.

Just for fun, here’s the upgrade screen we’ll face come the hour:

Day 10

Today I’m going to test the process of backing up the TeamCity EFS data.

File Backup/Restore

The most time consuming part of the operation was the data backup and restore. Since the system must be quiesced while the backup runs it also means downtime. A typical solution to minimise downtime in these kind of situations is to use a file system synchronization tool such as rsync to do an initial synchronisation with the process(es) accessing the source filesystem still running, then at the appointed time to stop the processes and re-run the synchronisation using the tool’s option to delete files on the destination that no longer exist on the source.

The first approach to backup/restore was to use awscli’s aws s3 sync command to sync the filesystem up to a bucket and then in a separate pass sync sync down to the target filesystem:

Here’s the output from the first attempt with the production filesystem:

time sudo -u teamcity DELETE=1 ./sync-data-to-s3
real 655m 16.028s user 304m 32.466s sys 165m 40.343s

So, around 11 hours just to do the initial upwards sync. How about refreshing the backup?

real 387m 11.638s user 16m 59.872s sys 4m 7.619s

So, it took over 4 hours to refresh, which can’t be started for the last time until TeamCity is shut down. Then we need to consider the restore situation — though I haven’t tested this part, it’s likely to take a comparable amount of time. In view of this I’ve decided to investigate other means of synchronisation. One piece of good news is that our syncing operations haven’t eaten into the EFS burst balance — here’s an interesting chart from CloudWatch:

The big red arrow indicates the start of a TeamCity cleanup operation that removed a lot of data related to Android projects. Since (in default mode) EFS accrues a burst balance related to the metered size of the filesystem, this is why the reduction in burst credits is observed. The yellow arrow is coincident with the start of the first sync to s3 while the green one lines up with the refresh runs some hours later. The good thing is that we never come within 50% of the IOLimit, the non-bursted performance of EFS in non-provisioned mode. Were we to go over the IOLimit and start eating into the burst balance that might impact normal operation, but that said, we never seem to need it. Even with normal usage we see a not dissimilar pattern to above:

Other backup/restore options

I looked into other solutions, none of which turned out to be fit for purpose:

AWS DataSync

This is a (mostly) managed service aimed at easing the process of copying on-premise data to S3 or EFS. It is strictly on-premise-to-cloud and can’t do EFS → EFS.

AWS Backup

This is AWS’ managed “Cloud Native” backup service. Restoring a backup creates a new resource with the data at the specified point in time, but it doesn’t work cross account.

EFS-EFS Copy

EFS-EFS Copy is an AWS “solution” — something you can build yourself by combining other AWS services using provided CloudFormation and other code. It does work cross account but is unnecessarily complex for our use-case where we don’t care about scheduled backups (strip away all the orchestration and it’s rsync running on an EC2 instance, we already have these pieces available in code).

Using plain rsync to sync 2 EFS proved pretty slow — around 12 hours to copy a quarter of the data (400G). Although this might be an acceptable transfer rate in some contexts, we want to proceed on the basis that:

The process is repeatable within time spans of days rather than weeks.
Refreshes are fast.

We therefore looked at what AWS use in their EFS-EFS solution and saw they make use the the fpsync utility — this leverages the fpart command to partition the source file set and spawns an rsync for each. We’re testing this now (as I investigated fpsync a bit more I came across another interesting AWS resource).

File Backup/Restore conclusions

Having chatted with AWS support today they confirmed that the best approach would be mounting the source and target EFS on an EC2 instance and use rsync over a VPC interconnection. We’ll now proceed to set up VPC peering between the VPCs and begin testing. Unfortunately fpart with 8 processes did not produce an appreciable speed up over a single rsync process, the problem here seems to be that we’re dealing with a very large number of files and the NFS overhead is the most significant factor. The next time round it would be interesting to use fpart’s option to execute each sharded process on a separate host as this might be a way to make use of EFS’ performance capabilities but our deadline was approaching and stakeholders we content to tolerate some overnight into morning downtime. Another problem is that for the final filesystem sync with deletion of target objects rsync --delete cannot be used by fpart since each rsync worker will have a different idea of what can be deleted on the target filesystem. AWS’s EFS copy solution makes use of fpart to sync new files from the source followed by a final, single process sync that includes the --delete --existing --ignore-existing flags. In our case it was as fast to simply use a single rsync --delete in place of these two steps.

In the end we were certain a “refresh” would take over six hours so at a high level our plan became:

19:00: Stop Teamcity and initiate final sync. Get a night’s rest.
06:00: check the sync completed and continue.

This plan required that we worked with stakeholders to ensure there were no overnight CI jobs that would create customer impact if they didn’t run.

Database Backup/Restore

For this we kept things simple and used shell scripts based around pg_dump and pg_restore that pushed and pulled the backup to/from S3 respectively. The other obvious option would have been to create a manual RDS snapshot, share it with the target account and use it with theDBSnapshotIdentifier parameter in our CloudFormation template. The main problem with the second approach is that we would have had to defer the creation of the new RDS instance until the step after capturing the backup in the migration operation, whereas with a pg_dump backup we could create the instance ahead of time. Both the database dump and restore only took ~2 minutes to complete so added no significant time to the migration,

The Migration Operation

The cornerstones of a successful migration are in my opinion:

Have a detailed plan

You need a plan. It should include:

Guidelines for how the plan should be executed in general terms, eg the role of each participant and guidelines for handling the unexpected (eg no recovery steps to be taken unilaterally).
The roles should include a coordination role — one person whose job is to ensure that all participants are in sync and move the operation forward.
The expected duration of each step so that you can monitor progress as you go.
Success and failure criteria.
Rollback steps and the cutoff time at which they must be executed.

Publicise your plan

We inform stakeholders over Slack (eg our #developers channel) about operations that will create downtime but we also find it really effective to create a calendar event and invite our @devs group — this gives us a way of seeing who’s paying attention via acceptance of the “meeting” and provides the invitees with a convenient reminder.

Automate things (but know when to stop)

Operations of this type generally involve running commands from various machines within different networks/VPCs/Cloud provider accounts. Given the cycle time required to test the overall orchestration it probably doesn’t make sense to automate everything unless you intend to repeat it many times but you definitely want to script the steps. Scripts should of course be treated like any other code, ie be subject to your team’s usual development processes. Scripts should also take as close to zero arguments and options as possible — spare a thought for the humans who will probably be working for a long stretch at an abnormal hour on this. We used machines in different accounts/VPCs for different steps which helped to reduce blast radius — for example our file synchronisation script ran from within one AWS account and the startup script detected this and mounted the source EFS read-only. Our scripts mostly inferred any options from the AWS account context (aws sts get-caller identity) and some had multiple safeing mechanisms (eg don’t attempt to restore a database backup to the live database even if the networking would not permit it). Machines for the purpose of running scripts were spawned via auto scaling groups and synced the scripts from s3 on startup (these having been synced from SCM during CI). The VM startup scripts set a different command prompt colour for each account, the live (source) account/VPC machines using bold red.

Be conservative about the operation duration

Things have a habit of taking a bit longer than expected and in an operation with many changes snags are inevitable. Provide a conservative estimate of the time you’ll hand the resource back to users and hope to exceed expectations. We’d promised our developers they’d have TeamCity back at midday and were able to hand it over at 11:00 despite a few snags along the way.