Taming Tactical Cluster Federation at the Edge… with Liqo!

Claartje Barkhof
The Liqo Blog
Published in
4 min readJun 27, 2023

Authors: Claartje Barkhof (TNO) & Johan van der Geest (TNO)

Introduction

We, from the research institute TNO and the Dutch Ministry of Defense, recently presented at KubeCon + CloudNativeCon about our research on applying federated cloud computing in tactical edge situations. In our proof of concept, we used Liqo and tested its functionality in disadvantaged edge conditions. In this blog we will dive into more details about our experimental set-up and findings.

In a military context, tactical federation can be described as the ability of manned or unmanned vehicles to spontaneously join ad-hoc cloud constellations, creating a resilient, distributed, and collaborative platform. We found Liqo to be an interesting candidate to implement this decentralized and dynamic cloud federation use case. But, where cloud computing solutions are often designed for data center environments where resources are plentiful and network conditions stable, in a tactical context this is all but the case. The network conditions are a big constraint, and for this reason it is important that a scheduler takes the network conditions into consideration. By incorporating this ingredient, we explored exciting new applications for using Liqo.

We were able to use existing open-source technologies to monitor the network conditions and to affect the scheduling in Kubernetes. We used the Optimized Link State Routing Protocol (OLSR) to monitor the network latency between cluster gateway nodes. This information is then published by a custom metrics API application, a standard Kubernetes way to make custom metrics available on a cluster. Finally, we used Intel’s Telemetry Aware Scheduling (TAS) with a custom policy to make decisions on where to schedule a pod in the federated cluster.

Scenario of the demo presented at Kubecon Europe 2023: main steps.

Monitoring

To monitor events in the experiment and gain insights into the interaction between components and potential issues in a disadvantaged edge environment, we created a management dashboard. The management dashboard provides a realistic tactical viewpoint by tracking the state of the federated cloud from the perspective of the user’s own cluster. The dashboard comprises of three main data visualization components: (1) a network graph representation of the state of the federated cloud at the current moment; (2) a plot of the OLSR network cost over time and (3) a timeline visualization that tracks the state of the federated cloud over time. The graph view summarizes the relation one’s own cluster has discovered and peered foreign clusters in the federated cloud. Additionally, it adds OLSR network cost information to relations for which this metric applies (e.g., incoming peering relations). The timeline view plots federation state events (e.g., a changing foreign cluster network status) and running workloads over time.

Design overview of the monitoring dashboard, including a network view and timeline view of the cloud federation. Additionally, OLSR network cost is plotted to demonstrate the effect of the TAS scheduler.

Constrained networking experiment

In our experiment, we aimed to validate the effectiveness of our proof of concept, which consists of Kubernetes, Liqo, TAS, and the OLSR custom metrics API as core functional elements, in a tactical federation with constrained network conditions. For this, we defined various network conditions as outlined in the table below.

Characteristics of the different network environments used in the experiments.

We examined the various features of Liqo: discovery of new clusters, setting up peering between clusters, offloading workloads to other clusters and unjoining the federation. The results of this experiment are shown in the table below.

Liqo features supported by Liqo in the different network conditions.

As we were curious to see which effect (the bandwidth, packet loss or latency) would have the most impact and cause the average and worst test cases to fail, we did some additional tests and found out that the average and worst test cases do work when applying a 200 ms latency (and the other effects still in place).

Based on that, we could conclude that somewhere in the range of 200 and 400 ms latency, it becomes troublesome or impossible to peer, offload, and unjoining clusters with Liqo. A packet loss of 15% or a bandwidth of 650 kbit/s is still good enough for Liqo to function.

Overall, it was an interesting experience to apply Liqo in a disadvantaged edge use case beyond what it was originally designed for and look forward to continuing research in this direction. If you would like to discuss or learn more about this subject, then please feel free to contact one of the authors.

View the full recording of the KubeCon + CloudNativeCon talk below, including a demonstration of the dashboard.

Additional details

Liqo version = 0.6.0
Kubernetes version = 1.23.5
CNI version = 0.3.1
Calico version = 3.23.1

--

--