At Strava, each of our major platforms (Web, Android, iOS) is known as a Guild. Each Guild likes to keep its platform as up to date as possible so that we get all the new security and performance fixes along with some latest features. This year the Web Guild decided to upgrade our monolith from Rails 4.2 to Rails 5.0. Generally, every engineer spends some amount of time doing this kind of platform work, but for larger projects, we staff at least one full time engineer (which is what we did for the Rails 4 upgrade as well). The roulette wheel was spun and I was the lucky winner selected to undertake this task. In this blog post, I would like to take you through how we successfully upgraded our platform from Rails 4.2 to Rails 5.0.
Stage 1 — Sulking with worrisome hair chewing
I have always been a product engineer. I had been at Strava for about four months shipping new features and bug fixes every week. My world fell apart when I was put on a project that was not customer facing but had the potential to be highly impactful. Was it challenging? Yes. Was I excited? Absolutely not! I was relatively new to the Strava product and the Rails framework. I had only done small features and fixed bugs here and there in Ruby until then so it wouldn’t be an exaggeration to say that I was a blank paper coming in.
I was confused, I knew what I had to achieve but I had no idea where to start or what the different problems were that needed to be resolved to get there. I was not equipped to estimate the time it would take to do the upgrade which was scary. After a simple change to bump the Rails gem version in our Gemfile, the bundle install command was failing, all the specs were failing, the server was not booting, and I was chewing my hair over not having a plan. The strategy was to have a strategy and the immediate goal was to breakdown the project into achievable pieces with clear outcomes. The real question was — Where do I even start?
Stage 2 — Identify help, short term goals and success metric
First things first, I cut my hair short. When you are floating in the sea in every which way without knowing the direction you need to sail towards, the first thing you need is an anchor, just to stop, reflect and determine the direction. At Strava, there are many engineers who are not only great at what they do but are compassionate and very helpful. My first action was to get help. One of the distinguished engineers at Strava, Pan Thomakos, who is popular for setting up the sail on unwieldy projects like this one, came to the rescue. He has been the driving force behind the upgrade ever since. Now that I had a person to reflect my ideas on and get direction, I could start sailing.
We began with a separate Git branch with the Gemfile change for the upgrade. We defined the set of things we needed to accomplish to succeed:
- Resolve all of our Gem dependency conflicts.
- Boot Rails (ensure all initializers run).
- Go through the list of changes in the Rails upgrade guide and investigate and resolve any issues that could be problematic for our system.
- Make the builds green again: fix the unit tests and the QA failures.
- Sanity check the website via manual exploratory testing.
- Merge, Deploy and Breathe.
Okay, now we are getting somewhere. We had a finite list but still a long way to go to achieve each of these tasks. One thing I would like to point out here is that although we had a higher level objective for each task, we did not know how much work each task was going to involve. Despite having a set of achievable goals, we still could not estimate the time it would take to be at a place where we would be ready to roll out the upgrade.
The most important success metric for us was a smooth release. At Strava, we strive for high availability with minimal, and preferably no interruption for our athletes. But, this was a project that required code changes to hundreds of files and impacted every feature in the application. So controlled deploy(s) with a minimum number of rollbacks was the best we hoped for.
Stage 3 — Define a strategy
Our strategy was simple: keep the Rails 5 Git branch changes as small as possible. For any task, determine whether the code changes were backwards compatible with Rails 4.2 and if so, make the changes there instead. The biggest advantage of this approach was that we were able to incrementally release the changes ahead of time, it reduced the risk of site outages that might otherwise be very high had we released thousands of lines of code changes on a single day.
Stage 4 — Get down to business
The first step was to install the gems successfully. We identified the gems that had dependency conflicts with Rails 5 and began addressing them. Most of those were relatively easy to resolve, they were either our internal libraries that needed to be upgraded to support Rails 5 or gems that we could do without. For the latter, we had to adjust code to remove the unnecessary gems. However, changes to resolve two gem dependency issues deserve a special mention.
RGeo Spatial Adapter
We had been using the MySql2Spatial ActiveRecord adapter Gem to represent MySQL geo-spatial data in Ruby. With this upgrade, we had to switch to MySQL2Rgeo adapter. While we were able to do this relatively easily, one catch was that the new Gem required some functions that are only available in MySQL 5.6 or higher. Although we run MySQL 5.6+ in production, almost all the developers at Strava were running MySql 5.5 locally. Thankfully, we have productivity engineers at Strava who took over this strenuous task of writing and iterating on the script to make this upgrade smooth for all engineers.
About three years ago, we created our very own library that kept track of database transactions and allowed dynamic registration of code blocks to be executed after the current transaction commits. If there were no open transactions, the code block would just execute inline. If the transaction rolled back, the code would never run.
Sound familiar? Rails introduced the after_commit and the other callback hooks in Rails 3.0, and Strava was on an older Rails version when this library was created. All these years, we had been upgrading this library to support the rails version. Well, upgrades are a great time to address technical debt. We wanted to take this opportunity to retire this library and use the built in Rails after_commit callback hook instead.
Easier said than done. There were about 50 to 55 places in the entire code base that referenced this library. The Rails after_commit hook is tied to activerecord, which makes it easy to track what transaction the current thread is operating on. The TxUtils library provided the same functionality but in a more generic way. It was not as easy as moving these calls to after_commit hooks. We had to reverse engineer the transaction that TxUtils was using in each case and change the code to keep the behavior intact. We solved this by writing a monkey patch which logged the transaction (along with a stack trace) to our centralized Logstash system. This helped us with tracking, addressing and verifying our fixes.
Even then, there were places which were too complex to make a straight-forward migration feasible. For example, there are a set of rules in our codebase that issue updates to our challenge leaderboard system whenever an activity is created, updated, or destroyed. The existing code was very complex and had actually been causing data inconsistencies. We have two engineers who would spend some time every week running a script to repair the inaccurate data. If we could take the opportunity to refactor this code while removing TxUtils references, we could also resolve the data inconsistencies and make our system more reliable and efficient.
We moved the challenge leaderboard recalculation logic to an idempotent background job that allowed us to also remove 700 lines of code. Given how how much of my own time was being consumed by refactoring the challenge leaderboard, I needed help with completing the remaining TxUtils retirement work. We had already recognized some common patterns for solving the TxUtils cases and started delegating some work out to other Web Guild engineers.
Now that we had a plan and resources in action to resolve the Gem dependency conflicts, I could move on to the other upgrade changes that were documented nicely in the official Rails Upgrade guide. It was just a matter of going through the list, investigating and resolving the issues.
We didn’t run into any significant issues, so once the Rails Upgrade Guide issues were resolved, we were able to successfully boot a Rails server and load the console. We did some spot checking to see if Strava’s basic functionality worked without issues and the tests were positive. At this point, we were seeing the light at the end of the tunnel and were pretty confident about the changes.
Stage 5 — Make the builds green again
We were in a good place. There was a finite set of failures that we needed to fix and the chance of surprises was relatively low. There were just a couple of things to do: fix the failing specs (about 10% of our specs were failing), add any missing coverage to the QA test suite, and test the application manually. Although it seemed a lot at that point, we were able to bring in all the Web Guild engineers to help with this task.
I. Fix the specs
It was time to come together as a team to resolve all of these spec failures. As chaotic as it was, we managed to work together and fix these efficiently. There were hundreds of spec failures. Many of those were either related (single fix addressed multiple spec failures) or they were failing for a similar reason in different places. So it was just a matter of coordination and helping each other out via code reviews to fix the issues. Except for a few failures, the majority of them could be categorized into two groups:
Failures due to controller spec changes
Rails 5 controller tests require keyword parameters for all of the HTTP helper methods as opposed to positional parameters that the Rails 4 ones use. This was a little worrisome because we were trying to keep our Rails 5.0 branch as small as possible and this change was not backwards compatible at all. Thanks to the blog by AppFolio Engineering, with the combination of their Gem and a couple Rubocop rules, we were able to make the master Git branch (Rails 4) enforce keyword parameters in the controller specs.
Failures due to changes in ActionController::Parameters
With Rails 5.0, ActionController::Parameters no longer inherited from HashWithIndifferentAccess. While this did not look like a big change for us, it did turn out to be a culprit for a lot of spec failures mainly because we treated params as a hash in a lot of places in our code.
If each engineer had worked in isolation on their tickets, it would have taken a very long time to complete this project because they all would have had to rediscover the same issues. Strava introduced the concept of Guild Week this year. During Guild Weeks, engineering teams take a week long break from regular product work to focus on their core technologies and platforms. Luckily, the third Guild Week of the year fell around this time that gave all the Web engineers the opportunity to work together to fix these issues. The key was spotting the pattern and coordinating the fix. Including myself, there were two engineers who got to review almost every PR which helped us in spotting common issues and communicating the solution to others.
II. Test the Application
While most of the Guild members worked on fixing the specs, some of us worked on adding more Cypress QA tests. If we had not added more Cypress QA tests, we would have ended up doing a lot of manual testing. Cypress is our automated end to end testing tool which we started using earlier this year, you can find more information about it in this blog post. This upgrade was a great opportunity for us to add more tests which increased our confidence in the entire QA suite. These tests caught two major issues with the upgrade which is a testament to their reliability and robustness.
The build was finally green on the second day of the Guild Week. On the third day, we were all doing one final round of manual exploratory testing on our Staging environment. Our branch had changes to 39 files with 170 additions and 181 deletions most of which were the Gemfile changes.
Things were looking good, we were in our final lap and ready to deploy.
Stage 6 — Hyperventilation with loss of appetite and sleep
The day had come faster than I thought. Although, I had spent four months working on this upgrade, the final release day came almost as suddenly as it was over. I am a reluctant optimist. What I mean by that is, I always hope that things go well but if they don’t, I am not a person who will start digging a well when the house is on fire.
I was a couple minutes late to the scheduled guild meeting where I was supposed to deploy the Rails 5 branch to production. I came into a room with about fifteen sets of eyes looking at me. It was Deploy time! I was too nervous to say any final words, I just started the deploy. Our health dashboard was on the big tv screen. While the deploy took all of three minutes, the dashboard looked normal. There were no spikes, or any indication of anything extraordinary happening. Panic hit me, did I accidentally deploy the wrong branch?, No — the branch looked correct. Does the Gemfile in the branch I just deployed say 4.2.7 instead of 5.0?, No — I just checked, it seemed correct. I was on the edge of my seat for an hour, my eyes glued to the dashboard looking for some proof, some error that would validate us being on Rails 5.0. The fact is, we absolutely, 100% were. It was a very smooth release.
It took us about five months, from the day we created the Rails 5.0 branch to the day we were live on Rails 5.0 in production. These were the best and the worst months of my career. Worst, because before I came into this project, I thought I was a good engineer. Well, its Ruby and Rails, how hard could it be? The image of me being a good engineer came down crashing and fast at that. But then, these were also the best months of my career. I was working with some of the brightest engineers at Strava. When it was time to huddle, everyone came together and pushed the Rails upgrade across the finish line. Today, I am more confident about planning and executing big projects independently. I have a benchmark for all future projects. From now on, every time I am on a daunting / scary project, all I need to do is ask myself: remember Rails 5? So, will I do this again if given a chance? Absolutely — and this time, I would definitely skip the sulking and hair chewing phase and jump straight into action.
There is a saying, it is all about the journey and not the destination. I am happy to report that while I have learnt some invaluable lessons on the way, it was always about the destination in this case. I would like to thank Pan Thomakos for his wisdom, guidance and motto, (to quote him) Always be upgrading! Onward and upward.