Fun things to do with k3s and Rancher

Belting out edge devices with automation for fun and profit

Jun 13, 2020 · 11 min read

I wanted to do something fun to really show off the potential of k3s and Rancher. In most cases these days, most people, most of the time run a single k8s cluster for production. Most of these clusters will have some kind of capacity trigger to help the cluster scale up and down based on some kind of metric. Sometimes it’s the CPU usage, sometimes it’s the number of requests hitting the front end load balancer.

In the case I’m presenting here, I’m not really not dealing with scale so much as remote nodes running on the edge, which can present its own set of challenges. Let’s say I’m taking advantage of the new hotness like AWS Wavelength or AWS Outpost. We have the ability to shove compute nodes closer to the edge, then run workloads on those compute nodes. Chick-fil-A did something similar to this by pushing k8 clusters down to their POS rigs in each location. Pretty cool.

What I want to focus on here is doing something similar, but I’m going to simulate the environment using AWS::AutoScalingGroups. Instead of having actual edge nodes floating around the place, we’re going to use tiny EC2 instances and install k3s on those nodes. We could use EKS, but the point here is to emulate the deployment and management of many small edge nodes.

In the case I’m playing out here, we would do something like deliver a bare metal compute node to an edge location, or provision a compute resource via AWS EC2. In some cases we might even be installing the compute node in a vehicle. We can’t necessarily guarantee that we’ll have a smooth provisioning back plane like EC2 or GCP.

Rancher for management

The first interesting requirement is to have rancher do all of the management for us. Also, I want to automatically have the rancher agent connected to the rancher server, and have the EC2 node automatically provision the monitoring rig that Rancher can push out.

That’s actually pretty easy, I have most of that worked out already.

User workloads

Here’s where it gets really interesting. I want to emulate having a customer base where I can allow my customers to deploy helm charts to a namespace in any of my clusters.

In most cases we would just call this a deployment. You have new service called serviceX, and you’d just deploy serviceX along side every other service. In most production environments we might have a manifest of services, for example:

  1. NodeJS service for running the front end

However, in our use case we want to allow our customers to be able to deploy docker containers directly to the edges that we manage. Ideally we’d want to build out a UI that would wrap our Rancher commands, but for the purpose of this demo, we’ll skip that and just shortcut some of the more time-consuming parts.

The idea here is to show off one of the more interesting, and frankly, complicated aspects of some pretty sick automation techniques.

User Story

Diving in a little more into the user story here. Let’s say we have compute nodes that run in vehicles. These compute nodes are gathering data about every aspect of the vehicle, including the state of the door locks. In real-time, these compute nodes are sending telemetry data over either the LTE or Wifi networks to a central service using something like MQTT or NATS.

The data ends up landing in InfluxDB or something similar and then, finally, fronted with Grafana. This allows us to view and aggregate telemetry metrics from a single node, or many nodes.

Another version of this would be something like monitoring the status of a fast food restaurant. Telemetry about the state of the fryers, how hot the oven is, how many burgers are being processed, how many customer orders are happening, their average wait time, etc…

I could also imagine a world of IoT devices doing something similar. The trick for our use case is also enabling a 3rd party to run arbitrary workloads on our rigs based on events that happen with our data.

For example, let’s take the fast food restaurant idea. I want to enable an insurance company to implement a quality control monitor to ensure that the temp of the oven never goes below a certain range.

Another use case for this workload deployment system might be allowing a 3rd party insurance company to deploy a “driver quality” monitor. An insurance company would want to monitor things like the use of braking, or acceleration, possibly even object detection events to determine if the driver is acting unsafe.

Most of these things could be done at a higher level with Grafana or some InfluxDB queries. However, that’s not very interesting, anyone can do that with about 5 minutes of work.

The specific thing that I’m targeting here is allowing our 3rd party customers to directly impact and deploy on our edge spaces in a safe, secure way that allows them a Rapid Deployment Framework ( RDF ). That seems really difficult, and therefor interesting.

Let’s build!

Let’s kick this off by building our rancher server.

Image for post
Image for post
Rancher is very easy to get setup

I use a tool that I wrote called Shatterdome to fire off my CloudFormation stacks to make this happen. Shatterdome is a fancy way of parsing and processing json blobs to create complicated stack definitions. It’s orders of magnitude faster and better than terraform. However, it’s pure Infrastructure as Code, so it’s not for everyone.

I launch a simple ASG using my CF tools. The magic for getting the rancher server up is:

docker run -d --restart=unless-stopped -p 80:80 -p 443:443 rancher/rancher

Now we can do something very similar for the user data for the edge nodes, this is the rough cut of how we can connect an edge node to the Rancher server:

# Install k3s
curl -sfL | sh -
# Pull down the rancher bin
aws s3 cp s3://krogebry/rancher /usr/bin/rancher
chmod 755 /usr/bin/rancher
# Store secrets in AWS::SecretsManager
aws secretsmanager --region us-east-1 get-secret-value --secret-id /com/krogebry/rancher/edge/token|jq '.SecretString' -r |jq '.' > /tmp/rancher
# Login to rancher
rancher login $(cat /tmp/rancher|jq -r '.endpoint') --token $(cat /tmp/rancher|jq -r '.bearer_token') --skip-verify
# Create a uniqe ID for this cluster
export EDGE_NODE_ID=$(uuidgen)
# Create the cluster object
rancher clusters create --import ${EDGE_NODE_ID}
# Connect the k3s host to the rancher cluster, cattle!
/usr/local/bin/k3s $(rancher clusters import ${EDGE_NODE_ID}|grep kubectl|grep -v curl)
# Find the cluster ID
export CLUSTER_ID=$(rancher clusters |grep ${EDGE_NODE_ID}|awk '{print $2}')
# Finally, enable monitoring
curl -X POST -u \"$(cat /tmp/rancher|jq -r '.access_key'):$(cat /tmp/rancher|jq -r '.secret_key')\"${CLUSTER_ID}?action=enableMonitoring -d \"{}\"\n"

By the time we get to the end of this ride, we see that we have everything setup to run and manage our edge node! The shell scripts here are really the magic that makes things happen, so you can plug this into whatever you want. For example, this should work roughly the same in GCP or Azure.

I tested things out with a single node, and once I was ready, I started cranking out more nodes by setting the ASG to 10/10/10. Caught a couple of bugs during the spin up, but now everything is humming along.

Image for post
Image for post
8 shiny new k3s clusters!

The Rancher server is starting to heat up under this load…

Image for post
Image for post
Fun stuff!

I decided to close out for the evening, so I set the ASG back to 1/1/1, now some of the clusters are starting to fail. This is expected, of course, but it’s also nice to observe how things fail.

Image for post
Image for post


Ideally, we’d probably want to wire up a WebUI with some fancy tooling to make this experience fun. I’m not really much of a UI guy, so we’re going to skip that for now.

For the time being we’re going to operate off of the Rancher UI and hope for the best.

Our user story can now be played out. Enter a gentlemen we’ll call Warboy Wally. Wally has been tasked by his boss to install an override governor on all vehicle rigs in the fleet. Apparently the one-armed bald chick who’s been operating the heavy fuel rig as been acting shady lately, so there’s cause for concern.

Wally’s boss is a real piece of work, so if Wally doesn’t get this done, he won’t ride shiny and chrome into Valhalla ( glory to the V-8 ). Obviously Wally is highly motivated to get this done.

The workload he has in mind will take inputs from the metics software running on a compute node in each of the vehicles. The software stack on these vehicles has been provided by a vendor and exposes triggers for like human detection and GPS data, and of course all of the other telemetry for the vehicle ( speed, braking, number of exploding spears loaded in the launcher, etc…)

Wally is going to login to the Rancher UI ( which some day will be skinned to better fit his company’s theme ) and perform a few actions:

  1. First, he will get his kubectl file from one of the clusters to test things out

Once he’s confident all of this works, he can start automating the process to deploy to his fleet. In the background, I’m going to be working both sides of this. I’ll be playing both the role of Wally and the role of the company making Wally’s workflows possible and discover what it takes to get this done.


I created an account for wally.warboy , so now he can login:

Image for post
Image for post
Wally can login and see the dashboard of the cluster

We like that Wally can get visibility into the cluster, but what we don’t want is for him to be able to see anything else in the cluster. We want our warboys of the Citadel to be locked down to their namespaces.

Image for post
Image for post
Here we go!

Right now, Wally ( and all of the Warboys ) are locked down to a single namespace called citadel. Now Wally can deploy his simple helm chart to the edge cluster in his namespace:

krogebry@cclab8-ht-esx-11:~/tmp/kube/helm/warboy$ helm install watcher ./ -n citadel
NAME: watcher
LAST DEPLOYED: Sat Jun 13 15:24:13 2020
NAMESPACE: citadel
STATUS: deployed
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace citadel -l "," -o jsonpath="{.items[0]}")
echo "Visit to use your application"
kubectl --namespace citadel port-forward $POD_NAME 8080:80

w00t! I works! Glory to the V-8!

Image for post
Image for post
Now Wally can see his active service!

As it turns out, I only needed 1 simple kubectl apply command to create the role binding:

krogebry@cclab8-ht-esx-11:~/tmp/kube$ cat rolebinding.yaml 
# This role binding allows "jane" to read pods in the "default" namespace.
# You need to already have a Role named "pod-reader" in that namespace.
kind: RoleBinding
name: manage-citadel
namespace: citadel
# You can specify more than one "subject"
- kind: User
name: u-8cxtk
# "roleRef" specifies the binding to a Role / ClusterRole
kind: Role #this must be Role or ClusterRole
name: p-lkc5t-namespaces-edit # this must match the name of the Role or ClusterRole you wish to bind to

The takeaway here is that I’d want to do some hacking around automating this down to all the other nodes and such so that we ensure this user has the proper permissions. Doing this manually the first time is fine as long as we understand the scope of automating it later.


For now, this is going to be good enough for my demo. Our illustrious Warboy can deploy simple work loads using helm to an isolated namespace on an edge node.

This was a ton of fun to work on and not very difficult to roll out, all things considered. If we were to get creative here, I could imagine changing this use case a little and envision a world where developers are allowed to push workloads down to production machines in order to test things live in production. That’s a hell of an idea. Giving developers a safe, secure way of deploying something to an isolated “edge” in production.

There are a few things I’d call out that we’d probably want to do if we wanted to roll this out “for reals”:

  • Harden up and automate the role bindings deployment. Have some kind of “new user” trigger that would deploy the bindings down to the clusters to ensure people are set.


There are two things that I really love about using CF. First, the concept of a stack and a UI to manage those stacks:

Image for post
Image for post

I just absolutely love this. The way I use CF is to have a version string baked into my stack, this allows me to rapidly develop new features and improvements without having to wait for stacks to come up and down. This is a huge improvement over Terraform, even if you’re using workspaces. This also allows me to delete a stack and know for absolute certain that everything attached to that stack is removed. Again, huge improvement over what TF is capable of doing. At the very least, these two things make rapid development of infrastructure possible.

Finally, the visualization is free. Here’s a screen shot of the redge stack:

Image for post
Image for post
Very simple

And here’s the network stack:

Image for post
Image for post
Very complicated networking stack

I can build huge, complicated things with very little effort.

And now I can remove everything by deleting the stacks.

Image for post
Image for post
Clean up time!

Here’s the specific commit for the new bits with Shatterdome.

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store