Episode-VI Distributed Scalable 5G with Observability

Published in

Open 5G HyperCore

11 min readJun 2, 2022

Authors: Fatih Nar Chief Architect at Red Hat, Steven Lee Sr Principal Software Engineer at Red Hat , Eric Lajoie Senior Solution Architect at Red Hat.

1.0 Abstract

Telecommunication, Media and Entertainment (TME) solutions (multi-verse of application catalogs working in harmony together effectively & efficiently) have been deployed with a mind set of; providing high level availability (with redundancy to cover possible disaster scenarios), accuracy & low latency (higher user experience and satisfaction) and meeting with regulatory requirements. In order to achieve these personas, TME solutions usually come with their own operational support systems (OSS) and business support systems (BSS). OSS is usually the part where an all-seeing eye performs observations about the running systems and generates insights for proactive and predictive operations. BSS is the part where running systems are tied into business workflows.

In the recent development/changes in the application development & delivery world with microservices, we have faced issues/problems with scalability & complexity of design, implementation and maintenance of solutions like 5G Core platforms (see Figure-1 above for the mesh of 5g services interacting with each other to deliver 5g connectivity), hence various different approaches brought up to market by different agendas/personas .

In our previous study (article link) we had analyzed & offered a working solution for a self-scaling 5g core implementation. In this particular study we will collect and analyze requirements from geo-distributed 5g core solution across different infrastructures including on-premise and multi-cloud, re-visit possible solution options for observability and traffic management via different technology stacks, and present pros and cons of each approach and finally conclude with our experience driven thoughts and prayers for future.

2.0 Introduction

In traditional hyperscaler infrastructure, there are region(s) with separated availability zones inside which compute, network and storage services are presented together with a distributed global network connectivity.

Hyperscalers aim to match their service/product catalog offerings across all of their regions and zones. However, there can be significant variations in availability, accessibility and affordability among them, which may limit the tenant workload portability. In order to avoid such discrepancy we will base our solution blueprint on infrastructure agnostic platform components such as Red Hat OpenShift, Red Hat Advanced Cluster Management, Red Hat Advanced Cluster Security and Red Hat OpenShift Service Mesh.

Figure-2 5G Core Look-Out on Hyperscaler Infrastructure

Our study scope (GitHub Repo Link) will be based on fulfilling the basic needs/requirements of a 5G Core deployment based on real world examples, such as having multiple interfaces available to 5G Core microservices, enabling service discovery between them, creating performant networking across all services as well as across multiple cluster footprints & geographies, and complying with local telco service provider security requirements.

3.0 Solution Overview

Before we lay our solution foundation it is important to understand/see how 3gpp has envisioned a distributed 5g core , especially for user plane function placement & selection with respect to where the user/consumer(s) is/are.

Figure-3 3GPP TS 23.501 SMF discovery and selection

3gpp allows selection of close-by/near-proximity SMF (by AMF) with UE location information through service discovery over a NRF query. In this approach we would need to have SMF + UPF as distributed 5g core CNF Bundles.

Figure-4 3GPP TS 23.501 UPF Selection by SMF

SMF can be allowed (or disallowed ) to control distributed multiple UPF instances based on 5gcore vendor’s implementation and/or how the solution is configured/deployed in the field. In this approach only UPF CNF would need to be distributed across different locations.

Both of which are significantly different approaches and come with their own challenges. In our testbed setup we have configured our central site SMF to have access to multiple UPF (central and remote) and hence only UPF CNF is distributed across different clusters/locations.

Selection of the right UPF instance can be determined by various variables; dnn, tac (example we used below Figure-5), cell_id etc that have been received from gNB/UE by AMF/SMF.

Figure-5 Centralized SMF with Multiple UPF Provisioning with TAC Selection

One last stop before delving into solution options; we would like to emphasize the importance of the automation which has been fueling the speed of delivering agile solution deployments across different infrastructures. We have been heavily working on design, test and deployment of autonomously provisioned 5g core solutions from bottom (IaaS) and up (SaaS) layers.

Zero Touch Provisioning (ZTP) leverages GitOps pattern to deliver consistent and non-drifted (i.e no snowflakes) software delivery experience. Emerging ZTP with a healthy operational control loop that collects, analyzes and triggers necessary actions on the existing deployments -> prevents vulnerability, also brings elasticity & performant resource consumption approaches of the cloud era into our deployments.

Figure-6 High Level Solution Blueprint for Distributed 5G.

There are mainly three network fabrics that connects the distributed 5g core deployments; cluster management fabric (hub to spokes), inter-cluster-connect (Nx spoke to spoke), private access to enablers fabric (hub/spokes to enablers-access-gateway / hyperscaler-internal-api).

In this study we will delve in to details of three possible solution approaches;

Service Mesh Peering Driven Interconnect , Traffic Observability & Steering
L2/L3 Inter-Connectivity Driven Traffic Observability & Steering
L7 Mesh Connectivity with/without Agents.

3.A Service Mesh Federation

Federation is a deployment model that lets you share services and workloads between separate meshes managed in distinct administrative domains. Service Mesh federation assumes that each mesh is managed individually and retains its own administrator. The default behavior is that no communication is permitted and no information is shared between meshes. The sharing of information between meshes is on an explicit opt-in basis (aligned with cloud era least privileged operations). Nothing is shared in a federated mesh unless it has been configured for sharing.

Figure-7 Service Mesh Peering Architecture for 5G

You configure the ServiceMeshControlPlane on each service mesh to create ingress and egress gateways specifically for the federation, and to specify the trust domain for the mesh. Federation also involves the creation of additional federation files. The following resources are used to configure the federation between two or more meshes.

A ServiceMeshPeer resource declares the federation between a pair of service meshes.
An ExportedServiceSet (ex; UPF exported from 5GCore-Site2 for geo-local break-out for User Equipments -UE ) resource declares that one or more services from the mesh are available for use by a peer mesh.
An ImportedServiceSet (ex; UPF imported to 5GCore-Site1 for inclusion of it for registered gNBs from that particular geo-locations, ex; based on tracking area code -TAC) resource declares which services exported by a peer mesh will be imported into the mesh.

Some details of what k8s service being exported/imported and why; as you would notice from the 3gpp 5g core architecture diagram UPF has three main network interfaces (see Figure-1) with each having special purposes.

N3: This is essentially the interface where the GTP tunnel gets established to carry user plane traffic. In our testbed setup this is a host device (net/tun) mapped via admin privileges to upf container.
N4: This is the control plane interface of UPF where SMF supervises UPF on coming user plane session establishment. This is the interface that is needed to be accessed by 5g core central site hosted SMF and it is exported from remote site to central site and included in central site SMF configuration.
N6: Break out to internet over remote site internet uplink.

Figure-8 Exported vs Imported 5g-UPF Service for Local Break-Out

Service Mesh Federation inherits native istio virtual services routing capabilities such as traffic mirroring, load balancing etc, based on various approaches (weight, label/path selectors etc). However this is totally abstracted from how 3gpp envisioned UPF selection would be, and hence istio virtual service http routing abilities would have no value here (!) but/instead, we use the imported service alias local name as upf association in smf configuration (Figure-8) for remote-site local break out.

Figure-9 Service Mesh Graph with Peered Services between 5g Core Clusters/Sites

Figure-10 Central vs Remote 5G User Plane Break-Out

3.B L2/L3 Integrated Clusters with Submariner

Submariner enables direct networking between Pods and Services in different Kubernetes clusters, either on-premises or in the cloud.

Figure-11 Multi-Cluster Interconnect with L2/L3 peering

By this approach, we do not need to export/import any service or deal with extra configuration, but simply tie multiple clusters together and let SMF directly reach out to remote site upf either over service ip or pod ip addresses. Red Hat Advanced Cluster Management enables submariner setup in an easy & seamless way with no manual configurations needed on clusters.

Figure-12 Streamlined Submariner Interconnect between 5g Core Sites

Enabling network accessibility across different 5g core sites can open the door for central observability/management solutions for 5g CNFs, two of which (NetObserv Operator, RH-ACS) will be presented in the coming section.

Figure-13 Inter-POD Reachability Across Different Clusters/Sites with Submariner

3.C L7 Integrated & Observed Clusters

There are various market ready OSS/BSS products as well as software as a services (SaaS) available for kubernetes stacks and containerized applications, however very few of them are able to address needs of TME application stacks, especially the complex microservices architecture of 5g core. Usually these OSS/BSS solutions are based on the assumption that the inter-networking is in place between where the solution is deployed and where we do place probes, agents, metric/log/trace-collection.

3.C.A Security Centric/Driven Observability

In this approach, the main focus is usually around network policies and how current network configurations on pod, namespace, cluster level are behaving with respect to them, and usually such solutions are based on add-on probes/agents to be installed on cluster nodes and/or via ambassador/sidecar containers.

Figure-14 Red Hat Advanced Cluster Security (RH-ACS) High Level Architecture

Figure-15 One Central Dashboard for All Clusters

We have leveraged Red Hat Advanced Cluster Security (RH-ACS) as a platform deployed on Hub cluster that remotely plugs itself to remote 5g-core central and 5g-ran remote sites and presents collected data in a human understandable and actionable manner.

Figure-16 Central Network Flow Monitoring for All Clusters & Services

RH-ACS can present network flows per cluster at a time with detailed flow breakdowns at pod, namespace levels.

Figure-17 Network Flow Monitoring Per 5G CNF Breakdown

RH-ACS can represent collected findings against well defined policies out of box or custom tailored with drift analysis and possible remediation offers.

Figure-18 Recommendations for Secure 5G with NIST/CISA Guidance

3.C.B Network Fabric Centric/Driven Observability

Red Hat OpenShift Container Platform (OCP) has had monitoring capabilities from the start. You can view monitoring dashboards, and manage metrics and alerts. With the OCP 4.10 release, Network Observability is introduced in Preview mode. The Network Observability feature provides the ability to export, collect, enrich, and store NetFlow data as a new telemetry data source. There is also a frontend that integrates with OpenShift web console to visualize these flows and sort, filter, and export this information.

Figure-19 Sending NetObserv Data from Spoke Clusters to Hub Cluster

In our testbed setup, we have only created observability backend (ie Loki) on the ACM Hub cluster, and we plugged in spoke clusters (where 5g core cnfs are distributed over two production sites; prod1–5gcore and prod2–5gcore) to the Loki backend on ACM Hub cluster.

Figure-20 Single Pane of Network Observability on Hub Cluster with Spokes’ Data Plugged-In

By collecting and storing network data, it opens up a wide range of possibilities that can aid in troubleshooting networking issues, determining bandwidth usage, planning capacity, validating policies, and identifying anomalies and security issues. This data provides insights into:

How much 5g User/Control traffic is flowing between any 5G CNF pods?
What percentage of the overall traffic is from gNB for control plane (ie SCTP) vs user plane (GTP)?
What are the peak times when there is the highest amount of traffic for UPF for user plane?
How many bytes per second are coming in and out of a single UPF pod?
How much traffic was handled by a particular SMF instance?
Is any traffic using insecure protocols, such as http, ftp, and telnet?

4.0 Summary of Findings

In summary; 5G Core needs to be distributed in order to offer correct scalability vs where user demand is happening. The way to achieve that is having multiple 5g core deployments. Some may have all CNFs and some only few (ex UPF for local breakout as we presented above from our testbed).

When an application domain (ie 5g core solution) spans across multiple network boundaries, the key challenge for observability lies in the fact that how we will connect/plug-in the observability solution to these leaf/spoke locations.

In this study, we have delved in from Layer2 (Submariner) till Layer7 (Federated Mesh), each of which comes with its own pros and cons. Based on our experience from past and present 5g projects, we can foresee that the most easy to install and use, and easy to scale observability solution to company 5g core deployment would win the day. Federated Mesh roadmap to offer central istio control plane may sound like an ideal solution, however the nature of 3gpp driven 5g solution architecture implementations so far has not predicted the same outcome (ex service discovery based on istio vs 5g-NRF use).

Unless we find a way to merge/marry istio service discovery and 5G-NRF , it looks like the possible solutions will be either based on;

External agent/probes on host levels with privileged access to network namespaces (and/or underlying virtual switching).
Pulling data natively from underlying network fabric constructs (ex NetObserv operator, Section 3.C.B).

We favor Option-II for security and performance as it will not introduce additional attack surface and avoid host/maintaining additional compute intensive application (ie agent/probes) overhead.

5.0 Closure & Prayers for Future

We wish 3GPP 5G architecture working groups and open source service mesh communities collaborate together to address the complex needs of TME solution stacks for the future (aka Mission Impossible).

We would like to see NetObserv Operator to officially reach the general availability stage with multi-cluster support (which we used in our testbed in this study), together with actionable insights generation, like RH-ACS offers, against defined network policy drifts (aka Top Gun Maverick). Also things like network data pull security with authentication and authorization (per ns) would be nice to have Lol.

Our short presentation at Open Infrastructure Summit:

Episode-VI Distributed Scalable 5G with Observability

Written by Fatih Nar