Virtual Kubelet Turns 1.0 — Deep Dive

Brian Goff
Microsoft Azure
Published in
8 min readJul 9, 2019

For the last year-ish I’ve been working on Virtual Kubelet, a pretty cool project created by some awesome people (not me). A lot has changed in the last year with several releases, over 70 contributors, integrations with AWS Fargate, Azure Container Instances, Hashicorp Nomad, Alibaba Containers, and more. We just tagged a 1.0 release, as such I figured it was time to talk about what we’ve been working on and how you can use it.

A special shout out to the people helping to make Virtual Kubelet successful: Ria Bhatia, Paulo Pires, Jeremy Rickard, Ben Corrie, Robbie Zhang. There are many others and it would be difficult to list them all here. A great big ❤️ to the whole Virtual Kubelet community.

Getting your toes wet: What is Virtual Kubelet?

Virtual Kubelet is a framework for creating and running a node in Kubernetes much in the same way that the the “Kubelet” does, but without being tied down specifically to machine-level semantics. Virtual Kubelet does not replace the Kubelet nor does it work with a Kubelet, but rather allows providers to build services that can act just like a node on your cluster in the same way that the Kubelet allows a machine to. An example of this is how Virtual Kubelet is used with Azure Container Instances (ACI) to allow scheduling pods directly to ACI and obtain instant burst-ability without the overhead of managing a virtual machine.

How does it work?

The Kubelet is responsible for a few things:
1. Registering the node it’s running on with the Kubernetes API server
2. Monitoring machine metrics such as RAM, CPU, and disk usage and periodically reporting the health and status of those metrics to the API server.
3. Watching the API server and ensuring that pods which were scheduled to that node are running and reporting their runtime status back to the Kubernetes API server
4. Provide access to pods for API’s like kubectl logs and kubectl exec (these requests get forwarded from the Kubernetes API server)

Fundamentally the Kubelet is a client of the Kubernetes API server and as such so is Virtual Kubelet.

The shallow end: How can I use it?

In the past we shipped a container image and a helm chart (still all available, but deprecated) that you could run and you just choose one of the compiled-in cloud providers, however this proved to be a huge maintenance burden especially with managing dependencies for all the clouds. It’s also difficult to build an experience tailored to a particular service when it needs to work with all of the services.

As such we decided to focus on building a highly reusable framework that providers can import as a library to build their own custom integrations. Splitting these providers out of tree is actually a recent move, but each provider that was in-tree now has their own repo under https://github.com/virtual-kubelet. Releases will be managed per provider rather than for all providers. For now, you can use the existing image and helm chart.

Off to the deep end: Library you say?

The core of this is implemented the node package, so called because it handles all the things related to what a node does in Kubernetes (see Kubelet responsibilities above).

In this package we have two core types:
- PodController
- NodeController

These contain the core control loops for handling scheduled pods and node registration/status updates. It is up to the user of the library to provide the interesting bits that actually do the work.

NodeController is created from a NodeProvider and a Kubernetes node spec.

The NodeProvider is an interface for interacting with the underlying node logic. We provide a NaiveNodeProvider, which is basically a no-op provider, you end up with a static node that doesn’t change for the life of the NodeController. If you supply your own NodeProvider you can notify the NodeController of changes to node status or even actual node failures.

type NodeProvider interface {
// Ping checks if the node is still active.
// This is intended to be lightweight as it will be called periodically as a
// heartbeat to keep the node marked as ready in Kubernetes.
Ping(context.Context) error

// NotifyNodeStatus is used to asynchronously monitor the node.
// The passed in callback should be called any time there is a change to the
// node's status.
// This will generally trigger a call to the Kubernetes API server to update
// the status.
//
// NotifyNodeStatus should not block callers.
NotifyNodeStatus(ctx context.Context, cb func(*corev1.Node))
}

You can use it like so:

ctx, cancel := context.WithCancel(ctx)
defer cancel()
nc, err := node.NewNodeController(
&node.NaiveNodeProvider{},
&corev1.Node{
// some node spec
},
client.CoreV1().Nodes(),
)
if err != nil {
return err
}

go func() {
defer cancel()
if err := nc.Run(ctx); err != nil && err != context.Cancelled {
// handle error
}
}()
<-ctx.Done()
return ctx.Err()

This is enough to get you a node setup in Kubernetes according to the spec you defined. It makes sure that the node status is periodically updated (period is configurable) and watches for changes from the node provider.
It also supports the new(-ish) API for node leases, custom error handling on node status update failures, and more.

NodeController manages the lifecycle of the node in Kubernetes, but it does not manage the lifecycle of pods scheduled to the node. For that you need the PodController. PodController watches the Kubernetes API for pods which have been scheduled to our node and makes sure the expected pod state matches reality. PodController does not actually run the pods, but hands that off to a backend implementation called the PodLifecycleHandler. This, like NodeProvider, is something you as a library consumer will implement.

type PodLifecycleHandler interface {
// CreatePod takes a Kubernetes Pod and deploys it within the provider.
CreatePod(ctx context.Context, pod *corev1.Pod) error

// UpdatePod takes a Kubernetes Pod and updates it within the provider.
UpdatePod(ctx context.Context, pod *corev1.Pod) error

// DeletePod takes a Kubernetes Pod and deletes it from the provider.
DeletePod(ctx context.Context, pod *corev1.Pod) error

// GetPod retrieves a pod by name from the provider (can be cached).
GetPod(ctx context.Context, namespace, name string) (*corev1.Pod, error)

// GetPodStatus retrieves the status of a pod by name from the provider.
GetPodStatus(ctx context.Context, namespace, name string) (*corev1.PodStatus, error)

// GetPods retrieves a list of all pods running on the provider.
GetPods(context.Context) ([]*corev1.Pod, error)
}

This interface is focused on basic CRUD operations for a pod. An optional interface can also be implemented, and you probably should if you plan to handle more than a small amount of pods, which enables the PodLifecycleHandler to asynchronously notify PodController of pod status changes. This notification allows PodController to update the Kubernetes API server of those changes so the scheduler can react as needed (such as re-scheduling a crashed pod), or even just so users can see what is happening with the pod.
Errors from this interface are expected to match the error interfaces defined in the errdefs package if you want the PodController to understand what might have happened in the case of an error and to handle accordingly.

type PodNotifier interface {
// NotifyPods instructs the notifier to call the passed in function when
// the pod status changes.
//
// NotifyPods should not block callers.
NotifyPods(context.Context, func(*corev1.Pod))
}

Without this interface, PodController just periodically asks for the full list of pods (GetPods()) and updates anything which appears to have changed from what is stored in the Kubernetes API server.

Putting this together with NodeController:

ctx, cancel := context.WithCancel(ctx)
defer cancel()
nc, err := node.NewNodeController(
&node.NaiveNodeProvider{},
&corev1.Node{
// some node spec
},
client.CoreV1().Nodes(),
)
if err != nil {
return err
}
pc, err := node.NewPodController(node.PodControllerConfig{
PodClient: client.CoreV1().Pods(),
PodInformer: makePodInformer(client),
ConfigMapInformer: makeConfigMapInformer(client),
SecretInformer: makeSecretInformer(client),
Provider: myProvider,
EventRecorder: makeEventRecorder(client),
})
if err != nil {
return err
}
go func() {
defer cancel()
if err := pc.Run(ctx); err != nil && err != context.Cancelled {
// handle error
}
}()
// wait to start the node until the pod controller is ready
select {
case <-pc.Ready():
case <-ctx.Done():
return ctx.Err()
case <-timeout:
return errors.New("timed out waiting for pod controller to be ready")
}go func() {
defer cancel()
if err := nc.Run(ctx); err != nil && err != context.Cancelled {
// handle error
}
cancel()
}()
<-ctx.Done()
return ctx.Err()

Here you should have a fully functioning node that the Kubernetes scheduler will schedule pods to and which can be run.

There’s also helpers to setup an HTTP server for serving endpoints that the Kubernetes API server expects nodes to implement.

Logging

You can enable logging using the logging interface. By default, logging is disabled (or rather a no-op), but you can enable it by setting the logger to use in the context.Context , we provide a Logger implementation for logrus:

ctx = log.WithLogger(context.Background(), logruslogger.FromLogrus(logrus.NewEntry(logrus.StandardLogger()))

This context should get passed into the Run() methods shown above.

Tracing

Tracing is also supported in the same fashion. Logs sent to the logger will also be propagated to the configured trace span.

type Tracer interface {
// StartSpan starts a new span. The span details are emebedded into the returned
// context
StartSpan(context.Context, string) (context.Context, Span)
}

We provide a Tracer implementation for OpenCensus that you can use, or you can make your own implementation.

ctx = trace.WithTracer(ctx, opencensus.Adapter{})

Both logging and tracing can be setup as a global as well if you prefer by setting log.L = <your logger> and trace.T = <your tracer>, but these are only used if one is not already set in the provided context.

Metrics

We do not currently have support for metrics, but I suspect these will be added in a similar way.

Prototyping

Getting started can be a daunting task, so we’ve created a framework for building a CLI for rapid prototyping and testing: https://github.com/virtual-kubelet/node-cli/
It is not intended for production use, but is a great way to get started.

What does 1.0 mean?

It means two things:

  1. We will not make breaking changes to the package level API in https://github.com/virtual-kubelet/virtual-kubelet
  2. The packages have seen enough usage and testing to say the implementation works and is stable

It does not mean that there are no limitations or bugs. For example, 1.0 will not support notifications for updates to ConfigMaps or Secrets on a running pod, though we do have this planned.
The downward API is mostly supported but there are some things that require cooperation with the PodLifecycleHander, e.g. filling in pod IP into environment variables which still need to be worked out.

Contributing, Support, etc.

Virtual Kubelet is hosted on GitHub, you can file issues and pull requests there. If you’d like to discuss something on Slack, we are on #virtual-kubelet on the Kubernetes slack.
We also have a mailing list: virtualkubelet-dev@lists.cncf.io.
We have a bi-weekly (err… every two weeks) community meeting on Zoom as well.

--

--