This is part one of our ongoing series on the Cruise PaaS:
Stay tuned for more on networking, observability, and deployment!
Every day, our self-driving cars navigate the streets of San Francisco. Our autonomous vehicles validate our software as they chauffeur Cruise employees around the city, continuously improving their driving ability by tackling the challenges of a complex urban environment. To operate continuously and safely, our fleet is supported by thousands of servers and interconnected cloud services. In addition to the web applications, mobile back-ends, and business tools common to every modern company, we manage 3D maps, navigation services, driving simulations, machine learning, data processing, test pipelines, security suites, and a lot more — both on-premises and in the cloud.
To handle such a wide variety of workloads — and support the hundreds of developers building and operating them at Cruise — we depend on our internal multi-tenant compute platform, which helps us reach production readiness at scale, enhances engineering productivity, and provides uncompromising security.
In the early days of Cruise, teams with disparate requirements moved quickly by managing their own cloud infrastructure and deployment. However, as our company grew, we saw an opportunity to accelerate development by reducing duplicate effort. We compiled the best practices from each team, re-evaluated our evolving requirements, expanded our team by hiring additional infrastructure experts, and started building a unified compute platform. The Platform as a Service (PaaS) team was created to develop this container platform.
Over a series of blog posts, we’ll share how we designed our container platform, the technical challenges we’ve faced, and how we’ve overcome them. In this first post, we’ll talk about some of the groundwork we laid in order to support a multi-cluster, multi-tenant, and multi-environment PaaS, built on top of Kubernetes.
Kubernetes, at its core, is a platform for building platforms. It’s not a highly opinionated turn-key solution — it’s more of a foundational layer built for flexibility, which is exactly what we needed to support Cruise’s wide variety of workloads.
This meant the PaaS team had our work cut out for us. Deploying a production-quality Kubernetes cluster on a hybrid-cloud private network is not a simple matter, and building a fully-functional multi-tenant PaaS spanning multiple Kubernetes clusters is even more challenging. With an eye towards simplifying our development and operational complexity, we evaluated our options for both deploying our clusters from scratch, and various managed Kubernetes offerings.
After several prototypes, we chose Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP) as our platform of choice. GKE provides a number of features and integrations that supplement “vanilla” Kubernetes, and these additional features saved us time both short-term (by not having to develop and provision them), and long-term (by not having to manage and operate them directly). Even with GKE’s extra features, we still had a lot of work to do to effectively manage multiple clusters in multiple environments, with interconnected Virtual Private Cloud (VPC) networks and production-ready security and deployment automation.
Environments & Tenants
When using Kubernetes, it’s tempting to assume you can get by with having only one cluster, and co-locating all your workloads to optimize efficiency and cost. However, it quickly became apparent to us that we still needed multiple clusters, even with our multi-tenant strategy. Co-locating containerized workloads gives us higher compute utilization, less operational overhead, and faster deployment. The trade off is that we sacrifice some resource, network, and tenant isolation. Sometimes, it makes sense to share, and other times it makes sense to isolate.
With multiple clusters and environments, we can choose which level of isolation best suits each workload. To ensure that hardened and secure production code isn’t running alongside more volatile development code, we opted to isolate clusters across environmental lines: development, staging, and production. This allows us to run production-like validation, load, and scale tests on our staging clusters without risking production availability or costly development interruption.
At the infrastructure layer, we configured GCP with vertical and horizontal boundaries:
- Vertically: GCP folders and projects separate development, staging, and production environments from each other. These environments each have their own VPC, which is shared between GCP projects in the same environment. Each network has its own subnets, firewalls, interconnects, NAT gateways, and private DNS. The networks are connected, but traffic between them can be easily audited and regulated.
- Horizontally: GCP projects within each environment allow for distinct permission management and visibility constraints between tenants. Projects make it easy for tenants to focus on what matters to them. Separation between tenants makes it harder for malicious actors to gain unauthorized access, and prevents tenants from accidentally impacting each other.
While GCP provides the primitives for infrastructure level isolation, Kubernetes provides the primitives for platform level isolation. At the infrastructure level, tenants can take advantage of GCP managed services such as CloudSQL, Stackdriver, BigQuery, Cloud Machine Learning Engine (CMLE), and Google Compute Engine (GCE). At the platform level, tenants can manage their container workloads, like applications, services, and jobs. Both layers help improve availability by increasing isolation, and provide boundaries on which to manage permissions, allowing for both coarse-grained and granular permission management. Both layers also make it easy for legal and security teams to audit who has access to what, and where each resource came from.
At the platform layer, we configured Kubernetes with vertical and horizontal boundaries:
- Vertically: Each environment and region gets its own Kubernetes cluster, aligning with infrastructure boundaries. These clusters each have distinct subnets, requiring traffic between them to route through ingress and egress channels which can be monitored and controlled using firewalls and proxies. This provides highly available load balancing which tenants don’t need to implement themselves, while simultaneously discouraging them from making their production workloads depend on non-production services.
- Horizontally: Kubernetes namespaces act as tenant boundaries between workloads. This allows cluster admins to manage permissions, quota, and resources at the cluster level while delegating similar control on a smaller scale to tenants as namespace admins. Namespaces are used as security boundaries for role based access control (RBAC) both in Kubernetes and in integrated systems, like Vault and Spinnaker. Namespace are similar to GCP Projects in that they make it easy for tenants to focus on what matters to them.
Platform as a Service
The overall Cruise Platform as a Service (PaaS) is a collection of integrated services, which enhance Kubernetes and GKE, and provide additional functionality for our engineers.
These enhancements can be grouped into a few areas:
Each of these areas deserve attention from anyone using Kubernetes, but they are particularly complicated when supporting multiple environments, clusters, and tenants. There’s always more to automate and integrate, and the more common functionality we can provide at the platform layer, the more our application engineering teams can focus on their unique domain-specific problems.
To Be Continued…
Through this series of blog posts, we’ll cover the security, networking, observability, and other technical challenges we experienced while building the Cruise PaaS. Continue reading with our deep dive on container platform security!
Interested in helping us build the platform paving the way for autonomous vehicles? Check out our open positions.
Interested in hearing more about Cruise? Watch some of our recent talks: