How Expedia Group Platform Engineering Revamped Their Compute Platform
A runtime compute platform built for developers by developers
In the last two years, the pandemic has given a huge blow to the travel industry. With things now opening up, an enormous demand for travel is approaching again, which some of us like to term as the “Travelution”.
We maintained positivity with the thought that “There is always sunshine post a dark night”.
Those who stand tall have a great vision and farsightedness to look beyond. This blog tells the story of how Expedia Group™️, a tech-first company, built a unified technology platform. The platform empowers developers to build applications providing unmatched traveller experience without worrying about the underlying infrastructure.
Stepping stone: choose your destination
The place which you always dreamt of visiting one day, keeping a strong desire.
For us, the platform engineering team, our dream was “a platform whose posture is tall on technology and can scale immensely as traveller demand skyrockets”.
Expedia Group comprises various brands as shown below. Many of them had their own platform tech stacks, some using Amazon Elastic Container Service, some using Kubernetes and some with no containerisation at all. This variation caused feature disparity between different environments.
The unified platform approach became the first motivation behind creating one common centralised platform that serves the needs of all the various brands and operates at scale to share the best use of technologies. This has fostered innovation, reduced time to production, enabled a standardised workflow and reduced cloud costs.
Now, let’s talk about scale:
- 35+ segmented functional domains identified
- 10k+ apps identified to host on this new compute platform
- 3k+ apps to be decommissioned, refactored, containerised
- 500+ Kubernetes clusters
- 40k+ hosts count nodes
- 4 region spread deployment
- 3 environments: Dev, Test and Prod
Plan the itineraries
2 Major attractions (Strategies)
- Shift-Left approach
Loop in continuous testing with your CI-CD and back your design approaches with spikes and proof of concepts. This means that each code commit going into the repo would not only be unit tested but also get tested for functional dependencies, security and even scalability. Remove the toil with automation as much as possible, as per site reliability engineering concepts.
- Paved road
Managed Platform: creating a control plane where app developers, i.e. platform users, are empowered with automation and self-service and their resources fulfilment is handled dynamically. developers need not know the nitty gritty of the platform, no Kubernetes knowledge even, and should not worry about platform maintenance/patching. It also enables seamlessly leveraging new platform-specific features on the fly for their hosted apps.
7 Sights not to miss when you build your platform (Big boulders)
Zero Trust, mTLS (mutual transport layer security), strong identity: different brands, as well as functional units, require clear isolation/segmentation for their applications to run on a common platform.
Istio service mesh has been leveraged to control inter- and intra-communication between apps. Also, to enable a clear demarcation, i.e RBAC (role-based access control), between developers and admin of the platform, teleport can be used.
Metrics, Logging and Tracing: for any product (yes, it’s a platform as a product), running huge live traffic on an application definitely needs best-in-class monitoring, troubleshooting assistance, and service level metrics for product robustness; and don’t forget to measure and draw your p50, p90 and p99 latency for various components flow.
At Expedia Group, we utilise lots of open-source tools such as Prometheus, Grafana, ELK, Vector, and a few enterprise tools like Splunk and Datadog.
Security has different layers, Kubernetes security sits on top of cloud security, needs to handle secret management, certificate management and perform regular vulnerability scanning at each layer. Getting compliance could also be an organisational target such as PCI and CIS benchmarking. There are a variety of open source and enterprise tools which serves this purpose e.g. kube-bench, Qualys and Prisma.
Resiliency can be thought of in terms of robustness, flexibility, scalability, and trust. Designs should consider disaster recovery, workloads spread across zones, regular backups, performance measuring, fault tolerance via chaos engineering frameworks and autoscaling. Karpenter is an auto-scaling tool which uses traffic patterns, to enable customisation of instance pools utilisation with optimisation, supporting platform strength.
While working with various different app workloads, networking isolation segmentation should be of utmost importance. A few apps must communicate irrespective of their region, few can or cannot, and a few should not communicate based on corporate policies or the data it contains. So defining the right tenant and boundaries around would surely help functional segregation and allocation of cloud accounts and their networking connectivity needs, plus cost tagging benefits per silo as well.
- Life cycle management
Rapid growth in cloud-native infrastructure demands multi-cluster management, as well as increasing challenges like configuration propagation, upgrade, and runtime operation. Kubefed V2 for multi-cluster config propagation have served us to solve challenges with some in-house operators and custom controllers built over the top to handle the load. We also followed the GitOps approach utilising Flux to enable the lifecycle management of core platform components.
- Cloud cost optimisation - FinOps
With the increase in cutting-edge technology adoption, there is always a cost involved, and for platform engineering, there has to be balance in the approaches when we target cost optimisation. Continuous monitoring and analytics of the cloud cost burn by resources is the first and foremost requirement. We need to know how we can save, for example by using spot instances for non-critical workloads or via optimised instances, what are our logging costs, storage costs, and high-end instances costs. Having all resources strictly tagged and attributed to cost management will achieve this goal.
Know your destination
Get acquainted with the visiting place's currency, language, transport, weather and so on
Similarly, in a platform world, you need to know about the minuscule of the platform. Ask these questions and prepare adequate details for onboarding.
- Does the platform support containerised apps or non containerised? If non-containerised are not supported, consider the time and cost to convert legacy apps into containerised apps.
- Does the platform have log stream capabilities, which could be needed for troubleshooting, auditing, running analytics and so on?
- Monitoring/alerting and their integration with the application.
- Is there any production readiness score for the platform?
- Is there any paved road for providing infrastructure resources for applications in a developer-friendly way?
- Is there any service level definition for the platform which matches your application up-time requirement as well?
- Are there any security exceptions and networking isolation needed?
- Any specialised needs for your app or edge cases? Does the platform need further development and customisation?
Pack your bags, get set go
Create your ultimate packing checklist as you embark on a new journey
One of the biggest challenges is migration — for application developers, an acceleration in the adoption of a new platform should be a priority so that they develop new use cases with the cutting-edge features of the platform, and while developing new features, avoid sticking with older infrastructure which is going to be deprecated and decommissioned.
Platform engineers should listen to the feedback from early adopters and modify the platform roadmap accordingly. Remembering also to inculcate the adopters with the roadmap changes as they happen.
“A ship in harbour is safe, but that’s not what ships are for.” — William Shedd
The platform team also needs to cover various traffic migration scenarios, with which applications could be lifted and shifted with a smooth transition from a legacy old platform to a newer one with ZERO disruption to the live running traffic.
There are many traffic deployment strategies which may suit depending on your application requirements, such as canary, blue-green, rolling etc.
Mix a balance of complex and easy apps in your migration strategy, so that valuable feedback with live traffic can be easily shared with platform engineering, who can then keep on building their backlog and delivery with the prioritised needs as traffic scales up.
That’s all — bon voyage!
In the end, it takes huge grit and determination with exceptional talent to build a centralised platform as a product to cover growing business needs for a big size organisation, and since a robust platform is the backbone for any organisation’s innovative products and services, it is surely worth it.
Wishing you all the best in your platform journey!!