A doctor for something other than worker nodes

The Kubernetes Node Doctor

Mickey Boxell
Oracle Developers
Published in
5 min readAug 25, 2021

--

Introducing Node Doctor

We are happy to announce the availability of a new Oracle Container Engine for Kubernetes (OKE) worker node troubleshooting tool we call Node Doctor. Node Doctor now comes pre-installed on all OKE worker nodes.

Node Doctor focuses on common issues related to the intersection between Kubernetes and Oracle Cloud Infrastructure (OCI), the majority of which impact the health of Kubernetes worker nodes. Node Doctor runs a number of checks to ensure a worker node is operating as intended. For example, Node Doctor can be used to indicate if the number of pods on a node is too high causing issues in the kubelet, the primary node agent running on each worker node, or if a node is running a known bad version of a dependency, such as runC, and should be recycled.

Node Doctor is intended to address common infrastructure level issues related to your OKE cluster worker nodes. Users who see issues with their workers nodes, for example ones that manifest with a Kubernetes Node Condition other than “Active” or Node State other than “Ready” should use Node Doctor to troubleshoot their nodes. Node Doctor provides insights on the underlying problems so you can get your nodes back online. It can also be used to capture useful data to share with Oracle Support. We built Node Doctor to empower customers to solve issues themselves.

Node Troubleshooting with Node Doctor

To troubleshoot node issues simply navigate to the node pool containing the problematic node and click Troubleshoot Nodes. This will open up a dialogue with multiple options for how to access nodes and run Node Doctor. We know our users follow different approaches to ensure their nodes are protected in line with their security practices. Keeping this in mind, we chose to support multiple paths to access nodes and run Node Doctor. Users with SSH access to their worker nodes can connect via SSH and run the command themselves. Users without SSH access can make use of an OCI Compute feature that allows for users with the correct privileges to run commands on a node even without SSH access. For more information about running commands on an OCI Compute host, see Running Commands on an Instance.

Core Functionality

Node doctor has two functions, checking for node issues and generating a support bundle:

Checking for Node Issues

sudo /usr/local/bin/node-doctor.sh --check

This command will perform a handful of precondition checks to ensure the foundations of the worker node are in place. This includes verifying whether or not the kubelet, the primary node agent that runs on Kubernetes worker nodes, is active, running the correct version, and can access the Kubernetes API server. After confirming the preconditions have been met, it will check for a variety of common issues and print either PASS or FAIL next to the issue depending on the outcome of the check. It will also save the output of the check in a log file for future reference. If one of the checks fails, Node Doctor will also print remediation steps and links to documentation where applicable. For example, specific networking related issues, including inactive proxymux certificates or an inaccessible kube-apiserver, will return an output of:

Network related failures have been detected. Please validate the network settings. Common mistakes include not using a service gateway, incorrect security list rules, and specifying the wrong subnet. https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengnetworkconfig.htmhttps://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengnetworkconfigexample.htm"]

This is what the Node Doctor --check command looks like in action:

$ sudo /usr/local/bin/node-doctor.sh -c
/usr/local/bin/oke-node-doctor does not exist.
Verified OK
chmod: cannot access ‘oke-node-doctor’: No such file or directory
INFO: Successfully downloaded node doctor.
Running node doctor...
PASS node health...
PASS DNS lookup...
PASS kubelet cert rotation flag...
PASS kubelet logs...
PASS service health...
PASS instance metadata...
PASS image and instance info...
PASS yum status...
PASS flannel status...
PASS coredns status...
PASS proxymux-client status...
PASS kube-proxy status...
PASS pods in ImagePullBackOff...
PASS pods failed mounting volume...
PASS runc version...
PASS pod usage...

NODE DOCTOR REPORT
------------------
16/16 checks passed
0 Signal(s) generated

Node doctor scan is complete. Report has been saved at /var/log/oke-node-doctor/oke-node-doctor-814.log

Generating a Support Bundle

sudo /usr/local/bin/node-doctor.sh --generate

This command will perform the actions of the --check command above and also will generate a support bundle, a .tar file containing diagnostic information that can be shared with Oracle Support. My Oracle Support (MOS) will provide information about how to upload the .tar file containing the bundle to a support ticket.

This is what the Node Doctor --generate command looks like in action:

sudo /usr/local/bin/node-doctor.sh --generate$ sudo /usr/local/bin/node-doctor.sh -g
INFO: /usr/local/bin/oke-node-doctor already exists and MD5 match.
Running node doctor...
PASS node health...
PASS DNS lookup...
PASS kubelet cert rotation flag...
PASS kubelet logs...
PASS service health...
PASS instance metadata...
PASS image and instance info...
PASS yum status...
PASS flannel status...
PASS coredns status...
PASS proxymux-client status...
PASS kube-proxy status...
PASS pods in ImagePullBackOff...
PASS pods failed mounting volume...
PASS runc version...
PASS pod usage...

NODE DOCTOR REPORT
------------------
16/16 checks passed
0 Signal(s) generated

Node doctor scan is complete. Report has been saved at /var/log/oke-node-doctor/oke-node-doctor-2127.log
Generating node doctor bundle...
Generated /tmp/oke-support-bundle-2021-07-12T18-01-12.tar

Node Pool Work Requests

We recently revisited the way we expose work requests for node pool and control plane CRUD operations. As part of this change we added detailed information for each request, including log message, error messages, and associated resources. This provides another source for helpful diagnostics in addition to the information made available by Node Doctor. Work request details can be accessed from the console, SDK, CLI, API, and other surfaces. For example, a user with whose cluster fails to create can navigate to the Work Request Details page of the console specific to their cluster and review the logs for details about the failure. For more information, see Viewing Work Requests.

Node Pool Work Requests

Future Plans

Node Doctor will continue to be enhanced over time to include additional issues, symptoms, and solutions we uncover.

Originally published on blogs.oracle.com.

--

--

Mickey Boxell
Oracle Developers

Product Manager — OCI Container Engine for Kubernetes (OKE)