Skyscanner and AWS: cloudy, with a chance of lessons learnt
Ashley Sole knows a thing or two about moving to the cloud — in this piece he shares his hard-won knowledge from Skyscanner’s recent project: the migration of 300 services from an estate comprising five data centres and 7,000 VMs — to AWS
Here at Skyscanner, at the end of 2018 we completed our migration to the cloud (AWS). This concludes a 3 year journey of moving Skyscanner from on-premise data centers to being fully hosted in the cloud. For the people working on the migrations, it has at times felt like a never-ending journey. We are immensely proud of our achievement - cloud migrations are not simple and we’ve learned a lot along the way. Many other companies have been on a journey to cloud for far longer than us and made less progress.
At the end of 2018, the hardware hosting Skyscanner in the data centers was approaching 7 years old, so at the start of last year we had to make a decision: would we re-invest in data center hardware at a huge cost, or execute an aggressive “all-in” cloud strategy? We chose to go all in, so we rolled up our sleeves and got stuck in.
What was involved?
The Skyscanner site was hosted across five on-premise data centres, two in the UK, one in The Netherlands, one in Hong Kong and one in Singapore. Each data centre had a VMWare installation hosting a total of over 7,000 VMs across all sites, along with various other physical hardware for databases, firewalls, storage controllers, and so on. In total, over 300 different services, owned by over 40 engineering teams within Skyscanner, were hosted across all our data centres.
How did we approach it?
You build it, you run it
- Skyscanner engineering principle
Skyscanner operates a “you build it, you run it” approach, so each team was responsible for formulating and executing a plan to migrate their own services. There was no central “migration team” responsible for migrating software; teams know their services better than anyone else, so they were responsible for deciding and implementing the most appropriate migration for them. In order to succeed with this approach communication between teams and clear expectations of deadlines was critical.
Throughout this process, we regularly referenced the 6 Strategies for Migrating Applications to the Cloud. Some of our migrations were rehost and some were replatform efforts, but in large part our migrations were cloud-native rewrites. The most dramatic piece of software modernisation we carried out for the migration was to rewrite the core Flights stack to turn a .Net, SQL Server-backed monolith into stateless Java microservices running in Kubernetes on AWS spot instances. This was a huge undertaking, but the gains in terms of software modernisation, resiliency, ability to scale and cost optimisation are huge.
Every team was in charge of their own migration and most team’s Plan A was a cloud-native rewrite. But we soon learned that for some migrations this was either not feasible or not appropriate. We used project roadmaps to define milestones and deadlines, along with a tonne of communication between teams to make sure everyone was clear about what was expected. This then made conversations about executing a rehost instead of a rewrite much easier.
What did we learn?
Lesson 1 — Rewriting software for cloud takes a very, very long time
Anyone who thinks you can take your existing stack, update a few scripts and magically become cloud native, is desperately misinformed. Full cloud-native rewrites are the stuff of dreams, but they are just that, a dream. If you don’t have the luxury of time (and let’s be honest, nobody does) then you must consider taking shortcuts via a rehost or replatform path if you wish to complete your cloud migration this century. To be specific, it took over 2 years to re-write our core Flights services. By comparison, some of our re-hosted services took a matter days to migrate.
“It always takes longer than you expect, even when you take into account Hofstadter’s Law.” — Hofstadter’s Law
Lesson 2 — Tech debt is necessary
We made a conscious decision to bring some tech debt forward into AWS strategically to hit a deadline. When you accept there’s always going to be a level of tech debt in your estate, you can start to understand these decisions. With the pace of software evolution today, a piece of software is pretty much out of date as soon as it’s written. The important aspect is ensuring your tech debt is manageable. When your tech debt translates to consistent customer disruption, or significant pain for engineers, then it’s time to address it. If you can live with it, then your time is better spent delivering customer value.
The biggest piece of tech debt was our SQL server estate which we rehosted to AWS. There’s nothing inherently wrong with SQL server, it is still the core of Skyscanner services. It is however a monolithic database that in order to modernise will need to be carved up into different technologies appropriate for their usage. For example, certain tables in the SQL DB rarely change, so would fit well with a static file store like S3; other parts change more frequently so a caching solution may be appropriate. It will take many more years to fully migrate completely, but this was a compromise we had to make to hit our targets.
Lesson 3 — Constant planning and excellent communication are key to success
Skyscanner has an immensely fast paced environment that is core to our culture. It means that plans can change quickly and we need everyone to be on board when a decision to pivot is needed. Constantly re-evaluating what you are doing is critical in this environment and it’s super important that you’re transparent about everything. When communication and transparency break down, people start to question why things are being done and are no longer bought in to the plans.
“I have always found that plans are useless, but planning is indispensable.” — DWIGHT D. EISENHOWER
The migration project was operated in a true agile sense, there was no “500 page cloud migration strategy”, just lean principles. We could constantly ask ourselves:
- What’s the current status?
- Are we on track?
- What are the blockers?
- How could we go faster?
Status emails would turn into meetings, which would turn into actions, which would turn into update reports — in the space of hours. This enabled us to move fast and deliver successfully.
For now, our cloud migration has ended, but our constant drive to evolve, modernise and keep our software at the cutting edge is a never-ending journey. If you’re interested in hearing more, feel free to reach out to me or anyone in the Skyscanner team.
Join Skyscanner, see the world
Life-enriching travel isn’t just for our customers — it’s for our employees too! Skyscanner team members get £500 (or their local currency equivalent) towards the travel trip of their choice in 2019 — and that’s just one of the great benefits we offer. Read more about our benefits and have a look at all of our open roles right here.
About the Author
I’m Ash Sole, engineering manager and technical leader in the Edinburgh office, with over 10 years experience in industry. I believe in a blameless culture and that the most important thing to deliver products is a strong, happy team. I have a passion for being extremely well organised and believe that a well organised mind is best placed to produce success. I love agile methodologies and am also a keen property investor.
If you liked this post, here is another one I wrote about my daughter arriving early and my company actually caring.