Episode-VII Burst OR Not To Burst!

Fatih Nar
Open 5G HyperCore
Published in
8 min readJul 18, 2022

Authors: Fatih Nar Chief Architect at Red Hat, Brandon Jozsa Principal Solution Architect at Red Hat.

1.0 Introduction

The term “Cloud Burst” has been in use in Telecom, Media and Entertainment (TME) industries to point to unexpected/unplanned-for/peak interest in the services & products offered to be consumed, often which can/may not be accommodated with existing capabilities and capacities.

One of the key reasons that TME service providers have been signing partnership agreements with hyperscalers is to address the “Cloud Burst” to avoid unnecessary capital investments to address the temporary consumption increase in their services portfolio.

Figure-1 Bursting Flow

The key application platforms for TME service providers have been on-premises due to various reasons such as;

  • Regulatory compliance reasons (ex; data locality).
  • Better total cost of ownership (TCO) vs return of investment (ROI) ratio for stable/saturated service consumptions with optimized infrastructure and platform characteristics.
  • End to end ownership & administrations from infrastructure to platform and application stacks.

In this article we will look for ways to address “Cloud Burst” with hyperscalers while trying not to compromise the listed reasons above.

2.0 Solution Architecture

The key characteristics of “Burst” are;

  • Can be unpredictable by means of date/time (please see section 2.4 for burst management).
  • Temporary/Ephemeral by means of duration.
  • Usually tied with low ROI against capital expenditures (Capex) + operational expenditures (Opex) needed,
  • Applications to “Burst” to be; true cloud-native with ease of horizontal scalability and ease of integration to customer/consumer traffic flow.

Hence the solution shall offer on-demand horizontal scaling with ephemeral resources with fastest time to market and lowest TCO including cloud spendings and talent required.

“TL/DR: Bursting shall be implemented with the highest level of automation and lowest level infrastructure cost/investment possible.”

Two major layers of bursting the 5G are; 5G application stack that implements 3gpp 5g standards and the application platform that accommodates 5g , both of which are subject to bursting. Bursting the application platform can be achieved in two different ways;

  • [Option-A] Expanding the size of the existing platform towards hyperscaler infrastructure.
  • [Option-B] Addition of ephemeral new cluster(s) on hyperscaler(s) to overall application platform farm.

Option-A: Although this is technically possible, we would not recommend this approach as we would be increasing the size of the failure domain as well as growing a common attack surface. The main cluster is already under heavy traffic and addition of new worker capacity will not provide relief (actually it will overload it) to the cluster control plane, also the mixing the different infrastructure types under a cluster formation will create non-homogenous configuration models (ie snow-flakes) for platform life-cycle management.

Please note that cluster-autoscaling is possible and can be recommended while preserving the serving infrastructure layer consistency. However that would not address/cover the bursting from on-premise to cloud/hyperscaler use-case

Option-B: Additional ephemeral cluster(s) comes with the cost of additional cluster control plane; however that helps us to lower the failure domain and segregate attack surfaces with control plane isolations. Hence we picked this option to continue building our recommended solution architecture in coming sections.

Figure-2 Platform Bursting Solution Topology

Main solution components;

  1. [Application Platform] Red Hat Openshift Container Platform (RH-OCP) to be the 5G Application Platform. For detailed analysis on the 5G application platform please refer to: Link1 and Link2.
  2. [Platform Management] Red Hat Advanced Cluster Management (RH-ACM) to construct the base for implementation of a “Burst-able” 5G Application Platform Management with internal policy engine to match declared available capacities. For detailed analysis on the 5G observability versus service placement please refer to : Link3 and Link4.
  3. [5G Stack Operations] Red Hat Ansible to automate the configuration management of 5G service components, update on traffic management patterns for distribution of incoming 5G traffic accordingly.

2.1 Burst The Platform

Let’s delve into details of the platform management which is the heart of the platform burst operation that will act as a 5g platform (i.e. cluster) dispenser on-demand.

RH-ACM cluster pool functionality provides rapid and cost-effective access to configured RH-OCP clusters on-demand and at scale. Cluster pools provide a configurable and scalable number of OCP clusters on Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure that can be claimed when they are needed.

Figure-3 Ready to Consume Clusters with Dispenser Pools on AWS, GCP and Azure.

Cluster pools are especially powerful when providing or replacing cluster environments for development, continuous integration, and production scenarios as well as addressing on-demand capacity increases (i.e. bursting).

Figure-4 Cluster Pool/Dispenser Sizing.

You can specify a number of clusters to keep running so that they are available to be claimed immediately for burst to cloud, while the remainder of the clusters will be kept in a hibernating state so that they can be resumed and claimed in a short period of time (compared to cluster creation).

Figure-5 Available vs Hibernated Clusters Ready to be Claimed/Used.

When a cluster claim is requested, the pool assigns a running cluster to it. If no running clusters are available, a hibernating cluster is resumed to provide the cluster or a new cluster is provisioned.

Figure-6 Available vs Hibernated Cluster Nodes on AWS EC2 Console Look-Out.
Figure-7 Claiming a Cluster.

The cluster pool automatically creates new clusters and resumes hibernating clusters to maintain the specified size and number of available running clusters in the pool.

Figure-8 Claim Cluster Success.

A cluster claim is completed when a cluster is running and ready in the cluster pool. The cluster pool automatically creates new running and hibernated clusters in the cluster pool to maintain the requirements that are specified for the cluster pool.

Figure-9 Claimed Cluster as a Managed Cluster.

When Bursting ends , i.e. traffic levels return to normal and it is determined that you no longer need extra capacity, you can initiate the destruction of the cluster pool. In case of cluster pool destroy, all of the unclaimed hibernating clusters are destroyed and their resources are released, please see section 2.4 for burst management.

2.2 Burst The 5G Stack

From the RH-ACM perspective, 5G Core is an application (with multiple 5G micro-services inside) and the application model is based on subscribing to one or more Kubernetes resource repositories (channel resources) that contains resources that are deployed on managed clusters. Both single and multi-cluster (i.e. burst case) applications use the same Kubernetes specifications, but multi-cluster applications involve more automation of the deployment and application management lifecycle.

Figure-10 Application Lifecycle Management (LCM) Model with RH-ACM.

Placement rules define the target clusters where resource templates can be deployed. Use placement rules to help you facilitate the multi cluster deployment (ie bursting) of your 5g core deployments. Placement rules are also used for governance and risk policies. Please see the following informational resources for details on multi-cloud placement rules: Link5 and Link6.

Figure-11 5G Application Stack Deployment Look-Out via RH-ACM.

2.3 Burst The Traffic

Post-Placement work for customer/consumer traffic management needs to take place when additional platform(s) is/are ready to be used on hyperscaler together with the 5G stack deployment on it/them; we need to plugin that new additional 5g capacity to incoming traffic path. This should account for ingress controllers, FQDNs and reachability to microservices in the other cluster. This can be done in various ways (individually as well as combination of both).

  • [Option-X] Leveraging Service Mesh: Using Istio Ingress with Istio virtual services to implement load balancing across multiple deployments of 5g across multiple clusters with Federated Mesh. Please see the Section 3.A in the following work for details: Link5.
  • [Option-Y] Leveraging External DNS for Kubernetes: Addition of newly created 5g deployment on hyperscaler to existing DNS record resolution path allows seamless service scaling. Please visit the following link for details on External DNS for K8s: Link6. Also using this approach can be coupled with geo-proximity information to serve 5g user/consumer with closest deployment location, hence this is our favorite option so far. Please visit the following link for details : Link7.

Note: More detailed study/work/guidance will be published on “Burst The Traffic” separately.

2.4 Burst Management

Figure-12 Burst Season (Online Shopping) Stats by Google (Ref: Link)

For the best price/performance operational model , we shall act consciously for usage of resources over time, so when to destroy a cluster vs cluster-pool? Here are some approaches:

[1] When to destroy a claimed cluster: Burst instance (Ex; Thanksgiving) completed and traffic returned to normal for the given time. However the burst season is not over yet and the cluster pool is still up & ready to provide additional cluster(s) when needed.

Figure-13 Destroying Cluster at the end of Burst Period.

[2] When to destroy a cluster pool: Burst season (ex Festive season in the US is across the last two months of the year) completed.

Figure-14 Destroying Cluster Pool.

3.0 Summary & Closure

Alright here we are, we talked and talked and talked Lol. Better to summarize what we aimed to tell in a simple flow diagram:

Figure-15 5G To be (Burst), or not to be (Not to Burst)!

Before we close this work and article, please remember there are multiple paths to fixing a problem, addressing a need etc. In this article we have 1st tried to explain the problem/need and followed with a solution including multiple choices inside with details on pros & cons.

Our solution approach and/or choices made inside the solution blueprint may or may not fit in particular technical context and or business realities of the possible consumer of the solution, however we are open minded to listen and adapt (or carve a better/different solution) based on your needs.

--

--