Infinity — our compute platform
Infinity — an introduction
Infinity is a turnkey compute platform solution at Scout24. Infinity enables builder teams to run their applications reliably and securely at scale, without worrying about the complexities of the underlying infrastructure.
Infinity aims to increase builder team’s efficiency and productivity by supplying the following features out of the box:
- Efficient load balancing with AWS Load balancers
- Human readable DNS records with auto-secured connections
- Auto-scaling on CPU utilisation
- Auto-ingestion of logs, metrics, and events to central solutions
- Integration with Datadog and Cloudwatch
- Automated dashboards, SLOs, and baseline alarms
- Distributed trace collection and publication to Datadog
Currently, we host more than 700 services spread across Scout24, including our most important drivers such as the homepage and search services.
Infinity transparently interconnects to Scout24 AWS accounts, so builders can benefit from all its features and still have the freedom of using their team’s AWS account resources.
Infinity is a group of (currently 8) kubernetes clusters built on AWS’ EKS complemented by custom and open-source controllers to fulfil our use-cases. We include components in all our clusters for:
- Metrics and Tracing (Datadog and metrics-server)
- Cross-account AWS service access via IAM roles (KIAM)
- Logging (fluentd)
- Backups (velero)
- Certificate generation and management (cert-management)
- AWS Load Balancer management (AWS Load Balancer controller)
- DNS management (CoreDNS and external-dns)
- Cluster scaling (cluster-auto-scaler)
and a slew of other controllers needed to implement business logic.
This article covers various aspects of our compute platform:
- Architecture Overview
- Infinity Service Kubernetes Components
- Infinity Service AWS account structure
- Connecting to AWS services
- Scaling Infinity Services
- Monitoring Infinity Services
All Infinity service interactions go through a custom CloudFormation resource. This helps us define a well-constructed interface with our builders. While the interface is satisfied, we can change the internals as necessary to improve the platform while keeping a consistent user experience and minimising user interruption.
The minimal requirements for an Infinity service creation are the following:
- The Service Token — The SNS topic that is the entry-point of the Infinity platform
- The Service Name — The name of the Infinity service
- The Image Name — A fully qualified docker image
- The Tags — Required for compliance and cost allocation
Everything else, including minimum and maximum resource allocation, containers, health check endpoints and time etc. have sane defaults enabled out-of-the-box.
Two custom controllers handle all the interactions for an infinity service on the cluster including updating the status, creating the relevant resources, translating between the CloudFormation and the Kubernetes spec etc.
Those controllers publish a few metrics that help us answer the following questions, among others:
- How long did it take for the service operation?
- Did the service update or did the operation create a new service?
- Did the operation fail or was it successful?
- Was the CloudFormation specification valid?
Infinity Service Kubernetes Components
Each Infinity service is composed of the following Kubernetes components:
- A Deployment — To create the Pods
- A Horizontal Pod Autoscaler — To handle scaling
- An Ingress — To communicate with everything
- A Pod Disruption Budget — To ensure minimum availability of the service
- A Replica Set — To guarantee the availability of x number of Pods for the Infinity service
- A NodePort Service — To ensure connectivity to the Infinity service Pods
- Labels — To uniquely name the Infinity service. These also act as selectors for all the resources listed above
Infinity Service AWS account structure
All Infinity Kubernetes clusters run in their own AWS accounts which are owned by a Platform team. This makes sure that we, the Platform, can maintain the Kubernetes clusters and make any infrastructure changes as necessary.
The owners of infinity services deploy their services into their own account alongside any other relevant infrastructure such as databases, caches, queues and so forth.
Connecting to AWS services
For services that can communicate via AWS IAM roles, we provide KIAM. KIAM works by enabling a container to assume a role with the necessary permissions in the product account (where the service is originally deployed alongside its dependencies). Although this role is configurable, out-of-the-box any role with the same name as the Infinity service is assumed in the product account. Using KIAM enables builders to focus on what AWS service their service should use instead of how to connect to this service.
For services that require network connectivity like RDS or ElastiCache, we provide connectivity via AWS Transit Gateways. These are disabled for accounts by default, to minimise cost expenditures, and can be optionally enabled by updating a configuration file.
Scaling Infinity Services
Scaling within the cluster is handled by the HorizontalPodAutoscaler for the individual service. Recall the components that make up an Infinity service. The only mechanism available for scaling, now, is CPU-based. Whenever the CPU utilisation of an Infinity service reaches the threshold, a new Pod is created automatically to handle the additional load. The default threshold is set to 50% and developers can set a maximum allowed number of pods
Monitoring Infinity Services
We use the Datadog platform for metrics and tracing and provide a custom logging solution based on ElasticSearch for log ingestion. On the cluster, metrics are scraped periodically by the Datadog Agent DaemonSet and published directly to Datadog where builders can easily use these as necessary. Tracing is handled similarly, through the Datadog Agent DaemonSet.
All logs forwarded to the standard output streams are automatically scraped by a FluentD DaemonSet running on each cluster. These are then viewable via a Kibana dashboard that makes searching cross-service easier.
We have covered a few topics in greater detail in our Scout24 Engineering blog.
For more information on how we implemented SLIs and SLOs for our Infinity architecture, please read Adopting SLIs and SLOs for internal cloud platform.
For more information on how we upgraded our EKS clusters from 1.15 to 1.22, please read How did we upgrade our EKS clusters from 1.15 to 1.22 without K8s knowledge?
If our compute platform intrigues you, please look at our open job listings here and send in your resume. We are always looking for talented, curious, and passionate individuals to grow our teams. Thanks for reading!