Story of a successful migration to Google Cloud Platform
For people who started their company on a cloud platform since day 1, what you’ll read below will sometimes look obvious or even funny. Anyway, a lot of startups or more established companies still rely on classic bare-metal servers infrastructure. For all of them, we hope it’ll help think about migrating to a cloud platform.
MeilleursAgents is a real estate platform which provides high quality information on the french real estate market. We act as a powerful catalyst by matching people with a project (eg. buy / sell / rent / invest) with the best real estate agencies for their needs.
When we first thought about moving our infrastructure to a cloud platform, we were just in the middle of raising new funds and had a lot of features to deliver.
However, we thought it was exactly the right time to prepare ourselves to speed up our pace of product innovation.
Servers were getting old and we were facing huge delays to get new ones (a couple of weeks to say the least). Improving your infrastructure in that context is quite improductive. In a cloud world, we just felt like living 20 years back in time.
We had no devOps skills in-house and knew we had to find help to successfully tackle this big challenge.
As a first step and before diving into the migration, we defined a few principles to rely on. We think they can help other people facing the same challenges.
- Define an aggressive deadline to prioritize: 4 months in our case. Trust us, it is quite a challenge but it will help you avoid a never-ending project.
- Corollary: don’t change too much. Migrate the infrastructure “AS IS” then refactor it when you’ve landed. It means that your infrastructure, once migrated, will not be perfect. But you’ll have time to improve it afterward.
- Follow infrastructure as code* principles (see “Infrastructure as code & on-click deployment” chapter below) and automate everything you can.
- Build a mixed team (this one highly depends on which skills you have in your team). As we were lacking devOps skills, we teamed up with a Google Cloud Platform partner in France (Skale-5) to help us implement automation best practices and learn Google Cloud Platform. Since we wanted to go fast, 1 of our senior engineer and the CTO were dedicated full time to the project to ensure that every aspect of our tech stack and infrastructure would be taken into account.
- Involve top management: weekly reports are enough, and even if very technical, you need to share every key aspect of your project, including risks and potential impacts on team velocity.
Choosing the right platform
This really is the easiest part of the project.
Our goal isn’t to compare GCP to AWS or Microsoft Azure to name a few. They are quite solid platforms and you’ll find tons of very detailed materials to help you find out which one better suits you.
Anyway, as we decided to go for Google Cloud Platform, here are a few reasons we thought it was the right choice for us:
- Services catalog was easy to understand and really well adapted to our needs.
- Great feedbacks from similar startups already hosted on GCP.
- Google Tech Team & their partners made a great first impression: they understood our context and were quite helpful to map our infrastructure to GCP services.
Defining a clear migration scope
An AS IS migration (well, almost)
At first, you’ll want to change everything. If it is intellectually seducing, it is also quite risky.
In our case, saying we migrated “AS IS”, is not the exact truth. We didn’t do any major change in the way our services were working but we used this infrastructure migration as an enabler to upgrade every major components of our technical stack:
- PHP from 5.4 to 5.6
- Every Python 2 apps to Python 3
- PostgreSQL 9.2 to 9.6
- Redis 2.2 to 3
We also had to change a couple of services and infrastructure components that were not in the Google Cloud Platform standards.
Mandatory changes in our tech stack
We were sending all of our emails through internal postfix servers. As it is impossible to do so on Google Cloud Platform, we migrated (for the best) to a well-known email service, Sendgrid. Nothing really interesting to mention there, except that, in terms of project management you’ll have to anticipate.
As a real estate platform, we were relying on a big Netapp bay to store a few To of pictures (listing, past sales, etc.). As it would’ve had implied a lot of changes in our code base, we couldn’t afford a migration to Google Cloud Object Storage and went for an alternative: a GlusterFS cluster.
Infrastructure as code & on-click deployment
Infrastructure as code is not an option. We wanted to secure our migration and guarantee that it would be self-documented and easy to deploy and maintain. Getting expertise on devOps practices was quite useful on that point. The team learnt to use Ansible and we added deployment tasks to our Jenkins server, which was already building all our apps. Ansible scripts are managed on Github like the rest of our code base, with Pull Requests to review and validate any changes.
We also defined a clear deployment process so that everyone in the team can deploy. We basically rely on 4 main tools:
- Jenkins for continuous integration and build orchestration, with direct feedback to Github for every Pull Request
- Docker for our dev containers
- Ansible to script our infrastructure and app deployment
Our build and deploy process is as follow:
To ease your migration, the more homogeneous your app packaging scripts are, the easier your Ansible deployment configuration will be.
In our apps stack, we use 2 main languages: Python and PHP. For both languages, we rely on strictly similar Makefiles to lint, test and package our apps. It was a huge job to standardize everything but also a key success factor for us to migrate without any big problem.
Working on processes and culture as well
When a team is used to a monolithic infrastructure, it has a clear impact on the way they think and architecture the system.
The most common bad habit is to reuse your servers as much as possible by installing a lot of (apps / jobs /services / you name it) on the same machine, to optimize costs. It often leads to poor performance, big scaling issues and hardens any investigation on bugs.
If you stick to that strategy, you’ll be at the opposite of a cloud platform philosophy.
As we’ve been migrating our infrastructure AS IS, we’re now splitting every services into much smaller components: code and its infrastructure.
This is mandatory in order to scale as your traffic grows.
A few weeks after we finished the migration, we hired a full-time devOps who had already had in-depth experience on a cloud platform. This is a game changer to work on culture as he is pairing with developers and improve their day to day jobs with smart automation tricks and tools.
Since day 1, we gave developers a full access to Google Cloud Platform so that they can get used to it. The most common usage is to pop compute engine instances to test something and trash it when they’re done.
Preparing for the big day
- As obvious as it could seem: play, replay and replay again your migration scenario. And stick to it. We were a bit too confident for some “simple” services and almost every one of them had to be fixed right after the big day. Yes, I’m sure you know this: “no need to test this, if the deploy works it’ll work”.
- Do huge QA runs on a production-like environment and ask as many people in the company as possible to do their daily jobs on the platform. Yes, it looks like doing QA on all your services but, trust us, you’ll need that to feel comfortable when you’ll actually migrate.
- Performance and load testing are non-negotiable. If, like us, you don’t have time to build a large performance test suite (with Gatling for instance), you can effectively use your web logs and a few home made scripts to replay real traffic and find your system’s limits.
- Determine if a downtime is acceptable for your business. It has a huge impact or your migration scenario. So if you can afford it, just go for it. This was an option for us and we basically did the migration during a weekend night.
- Define a team responsible for the migration and clarify everyone’s role for the big day.
- After the migration is done, monitor everything and dedicate team members to correct every problems you could face. In our case, we had a few to handle but nothing major.
The migration is just the beginning
Since we switched to Google Cloud Platform, we defined a roadmap and a strategy: we’ll rely on as many managed services as possible.
We’ve already done interesting changes:
- As one of the core features of our platform relies on advanced map usage, we heavily use Cloud CDN to cache our map tiles.
- Split services as much as possible to ease analysis in case of a problem and avoid potential side effects.
- Replace our on premise docker registry, with Google Container Registry.
In the upcoming weeks we’ll rely more and more on managed services:
- We’re actively looking into Google Cloud Storage to get rid of our GlusterFS cluster.
- The managed PostgreSQL, still in beta on GCP will be benchmarked as soon as it reaches GA. It would be a great opportunity to avoid complex administration tasks on our database.
- At some point, Google App Engine could be an option for us.
Engaging in such projects is always a gamble, even if you do everything possible to reduce the risks. Anyway, doing this change was probably the best move we could do for our platform and our team.
To name a few benefits:
- It allows us to adapt to our traffic growth. We’re now in capacity to refactor our infrastructure and rely on classic scaling patterns.
- It was the perfect moment to industrialize every aspects of our infrastructure management: this is a huge velocity improvement as all our infrastructure and deployment processes are now versioned and reproductible.
- For all our new projects, building the underlying infrastructure is not a problem anymore.
- On the performance side, without any major refactoring of our code base, we’ve cut our response time by a factor of 2.
- Development team day to day job is easier: whether it is for a POC or a need of an on-demand database dump to investigate something (disk snapshots of a 700+ Go PostgreSQL are just blazing fast), everything is so much easier.
- Infrastructure as Code: https://www.thoughtworks.com/insights/blog/infrastructure-code-reason-smile