Migrating our SaaS Product to AWS

Jamie Redfern
StreetGroup
Published in
7 min readAug 5, 2021

Spectre is a mature Software as a Service (SaaS) product at the intersection of the property industry and technology (aka prop-tech). It’s arguably the most powerful prospecting tool available for estate agents. The following is the story of one of our development projects, migrating Spectre to Amazon Web Services (AWS).

One of the greatest challenges for any modern tech company is finding the right balance between new development and paying down technical debt. Too great a focus on new development, and eventually your system will become difficult to work with, will repel talent from joining your team, before ultimately drowning in a sea of bugs. Too great a focus on technical debt, and you risk not keeping up with the competition and haemorrhaging customers from all sides. If we are being honest with ourselves, it isn’t exactly a fair fight. New development will win more often than not. This is why it’s important for a development team to make sure that technical debt has visibility around a company, so that when a change does need to be made it can have broad-based support. This was the case with our team migrating Spectre to AWS.

Spectre has expanded rapidly over its six year life, and was experiencing growing pains that needed to be addressed. Historically, there were times when customers experienced spikes of heavy usage that made the product slow to use for several hours a day, even bringing the application down for minutes. To resolve this we needed both scalability and the ability to run certain processes across multiple servers to balance the load. This is where AWS enters the picture.

AWS has become an industry standard for hosting, used by many household names. It offers the ability to manage our website over multiple servers, and AWS has the additional advantage of being well suited for scalability. We particularly liked the example of Comic Relief, where AWS was able to handle the rapid scaling required for one day a year. If it could do that, it could handle our daily usage spikes. There was also a company-minded reason for opting for AWS, it was used by other teams. It’s important for a development team to have independence to find its own solutions, but consistency across teams is certainly something to consider, and it made AWS the obvious choice for us.

There were two major blockers in our way almost immediately. The first of these was that our team had little experience of working with AWS, and we would need to resolve this before doing anything. The second was a matter of prioritisation. Our team was relatively small, managing a large product. The effects of the Coronavirus Pandemic on the property industry meant that we have had to take a more reactionary approach to reflect changing conditions, and committing significant resources to something that would have relatively little impact to the end user was a difficult sell. We were able to do it by being open with the rest of the company about the importance of managing technical debt. When it came time for us to insist on the importance of the migration, the team was listened to and protected in order to take action.

With the prioritisation problem resolved, we immediately set about fixing the knowledge gap. Every member of the team took an AWS training course to make ourselves familiar with the basics. The initial aim was to create a User Acceptance Testing (UAT) server based on AWS with a read-replica database, to ensure that all existing functionality was in place before tackling the production migration proper. Whenever the team ran into a problem, members of other teams would be available to answer questions or pair-program on a particular issue. This allowed the team to gradually build knowledge and experience in the new technology, and allowed us to test the unexpected issues that arose, such as an issue regarding our Google Maps API that caused all Maps to be covered in watermarks. As the UAT server stabilised, eventually we started to prepare for the actual AWS Migration.

As we neared the moment where we thought the migration was ready to go live, we started to plan how we would actually roll out our changes. We ran through various options and identified that the migration would result in significant disruption to our customers, so decided it should take place outside of usual business hours. We would need to give our clients advanced notice that the application would be unavailable for a period of time. We liaised with marketing, and sent out emails informing users that the application would be in maintenance mode each evening between Wednesday to Friday. This gave us enough opportunity to coordinate a trial run, rollback and retry if we were met with issues.

On the Tuesday of migration week, we held a team meeting to run through what actions we would need to take. Whilst the migration we had planned worked, it relied on performing actions through the AWS console and running commands from a shell. This added a significant amount of risk, and so we opted to automate as much of the migration as possible through Terraform. When migration day finally came, the Friday, this reduced the deploy attempt to running a single command. A much simpler and safer process. However, running the single command had a single result: an error screen. At that point we were unable to diagnose it. We knew that we had to either press on and try and solve the problem, or we could back out and admit the attempted migration had been a failure. We chose to back out.

It was disheartening, but in hindsight, it was the correct decision. Instead, the team came back refreshed on Monday morning, and set about solving the problem: It was caused by a long lived AWS Elastic Computing Instance (EC2 Instance) which was created at the beginning of the project by the AWS Launch Wizard. The EC2 Instance didn’t make it into Terraform, and was destroyed and subsequently never remade on the recreation of the environment.

Having learned our lessons, we set about trying again. This time we ran through a risk-mitigation exercise, and took pre-emptive action to solve each problem before it arose. We cancelled our usual meetings, instead mobbing to solve the problems from Friday. We then tested the process of spinning up the AWS server with Terraform from start to finish multiple times, to uncover any issues that caused Friday’s failure. By Tuesday we felt confident enough to have a second attempt. We spoke with marketing, sent out another email announcing that our application would be unavailable the following evening. When it came time to actually run the migration on Wednesday, it went without a problem. The process was completed in a couple of hours.

There was only one relatively minor problem post-launch. In our UAT environment we used server sizes appropriate for a testing environment, and opted for larger sizes in the production environment, with an oversight in one area: the database. When our UAT environment was slow, we increased the size of the EC2 instance, but the size of the Relational Database Service (RDS) instance remained the same. Once we had moved to production, the RDS instance size quickly proved to be unsuitable for our application under a standard load.

However, with the migration now in effect, spinning up a bigger RDS instance in-place proved to be trivial. A concerted effort took place to ensure as much of the migration was within Terraform, however, due to the urgency of the situation, a brief downtime was enacted and the change applied within the AWS console. When the success of the RDS upscale was confirmed, the ad-hoc change was then imported to Terraform.

Previously, performing this task by hand would’ve taken hours (having to manually perform orchestration and configuration for new and improved hardware, setting up the mysql install and configuration, setting up replication, etc.), this was now reduced to a few clicks. There was minimal customer impact, and we were back at operating capacity in 20 minutes. All in all, the migration had to be considered a success. With AWS and code as configuration, we now had a blueprint for future infrastructure changes, both well planned and ad-hoc.

We took several key learnings from this experience. The first was the value of risk mitigation. Performing this task during the second launch, where we outlined everything that could go wrong and what actions we could take to minimise these risks, was invaluable, and a major factor of why the second launch went so smoothly. This is now something we do as part of all major releases. We also learnt the lesson of maintaining perspective.

We were eager to get the project over the line during the first launch, and it was difficult to admit we needed to step back and see the bigger picture. However, deciding to back out of the migration was the right decision. We gained fresh insight, prevented fewer problems, and ultimately made the platform more stable than if we had pressed on during that Friday evening.

The final lesson was the importance of paying down technical debt. The platform has been much more stable in the four months since we made the migration, which has allowed us to do more work developing features that more directly impact our customers. Far more time would’ve been spent over those four months putting out fires caused by the system’s foundations than we actually spent on the AWS migration. Making sure that this value is recognised throughout the company has placed our team in a very strong position as we look to face new challenges in the future.

To find out more about life at Street Group, visit streetgroup.co.uk, our Glassdoor page, or visit our careers site.

--

--