Memory/goroutine Leak with Rancher(Kubernetes Custom Controller with client-go)

Yuki Nishiwaki
ukinau
Published in
17 min readApr 12, 2020

--

Today, I would like to introduce one troubleshooting case about Memory/goroutine leak in Kubernetes Custom Controller which is implemented with client-go.

Nowadays, there are many Kubernetes Custom Controller implementation to help System Operation on Kubernetes since Kubernetes Operator Pattern got famous.

Rancher is also one of the Custom Controllers which is implemented based on client-go . Rancher is the software to manage multiple Kubernetes Clusters from multiple different Cloud Vendor including On-premise.

Today’s case is the one from Rancher. But Root Cause was not Rancher specific logic, it was caused by the wrong usage of client-go.

So let me begin…

Main Target Version of Rancher in this posts

This troubleshooting case is based on following Rancher version.

Actually from rancher/rancher:v2.3.3, this memory leak problem has been solved thanks to Fix goroutine leak on failed sync() or before .

In the end of troubleshooting I noticed there is already solution and merged after I found root cause…

But unfortunately even after this get merged, I also found another goroutine leak which is very similar to what above patch solved in latest rancher/rancher. Because rancher/rancher started to use new controller framework which is called as a rancher/wrangler in ClusterManager and then similar goroutine leak came up again in HEAD as of 10 Apr 2020 (rancher/rancher:v2.4.2).

Although I focus on explaining about memory leak case which is happening at rancher/rancher: v2.3.2 and which will be solved by Fix goroutine leak on failed sync() or before. If you read this post, you may be able to find out how memory leak happened in latest rancher (rancher/rancher:v2.4.2) as well, since root cause is pretty similar, that’s why I could find easily as a side fruits :)

I also left the comment for the memory leak issue about another goroutine leak I will not explain in this post onto Github, so after you read this post, please visit there as well. You should be able to understand what I meant https://github.com/rancher/rancher/issues/21361#issuecomment-612500020.

Observation

The other day, We got alert about Memory Usage and also noticed that our rancher server(aka cattle-server) got restarted several times on our kubernetes.

$ kubectl get pod -n cattle-system | grep cattle-server
cattle-server-64584fcb4d-7dgpv 1/1 Running 11 2d
cattle-server-64584fcb4d-95p5t 1/1 Running 2 2d

This was actually killed by OOM-killer due to exhaustion of memory on the node, and then thanks to kubernetes, they noticed the container is not responding and trigger restart container. That’s why I’ve observed the rancher-server got restarted several times.

$ cat /var/log/messages | grep 'Out of memory'
kernel: [6065518.465063] Out of memory: Kill process 95719 (rancher) score 1088 or sacrifice child

So how rancher server’s memory usage got increased.

rancher-server’s memory usage

For last 2 weeks, rancher-server keep consuming memory usage, and finally this rancher-server consume almost all memory on the node and then got killed by OOM.

At the moment, rancher-server haven’t created/managed any extra kubernetes cluster, that’s why we had no idea why rancher-server consume much memory except for the possibility of memory leak in rancher-server.

This is the reason why we started to investigate memory leak.

1. The number of goroutine

When I suspect memory leak in the program written in Golang, first point I would like to check is the increase of the number of gorouine. If the number of goroutine also got increased. We can think memory leak caused by goroutine leak, which is more easy to identify the place to cause memory leak.

rancher-server’s goroutine usage

Actually the number of goroutine in rancher-server also keep increasing. That’s why we could assume there is the logic to spawn infinite loop goroutine periodically.

2. What kind of goroutine we run (pprof)

Fortunately rancher-server import pprof (https://github.com/rancher/rancher/blob/master/main.go#L8), so we can analyze more details with pprof

package main

import (
"context"
"fmt"
"log"
"net/http"
_ "net/http/pprof" // <= import pprof

But rancher-server doesn’t expose pprof endpoint with reachable address from outside of the host, So We need to configure port-forwarding and need to access with localhost

$ kubectl port-forward pod/cattle-server-7dd577dbcc-995fm -n cattle-system 6060
Forwarding from 127.0.0.1:6060 -> 6060
Forwarding from [::1]:6060 -> 6060
....
$ go tool pprof -http :3000 localhost:6060/debug/pprof/goroutine
Fetching profile over HTTP from http://localhost:6060/debug/pprof/goroutine

Here is the result of distribution of goroutine.

We can observe many goroutine running for “workqueue.(*delayingType).waitingLoop” .

3. How heap-space looks like (pprof)

Now that we know we are running bunch of goroutine around “workqueue”, but “workqueue” is usually used in the implementation of Custom Controller, and rancher-server implemented bunch of controllers(https://github.com/rancher/rancher/tree/master/vendor/github.com/rancher/types/apis/management.cattle.io/v3) inside one binary.

So let us try to find more clue about where exactly use much memory in heap-space.

$ go tool pprof -http :3000  localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap

Looks like around rbac.NewAccessControl , (*roleClient) Controller, (*clusterRoleBindingClient) Controller use much memory.

So far what we can assume is there may be problem around Controller implementation in rancher-server. So let us collect more data what is happening around Controller in rancher-server

Rancher Internal Metrics about Controller

rancher-server give us great metrics to explain internal behavior. So let us check these metrics as well but before that, you need to know how rancher-server implement Custom Controller so that you can understand what metrics indicate actually.

Rancher’s Kubernetes Custom Controller is implemented based on the framework developed by Rancher Lab, which is called as a rancher/norman .

This framework give us kind of good developer experience, which is similar to Aspect-Oriented Programing Paradigm when you develop Custom Controller.

Quoted from lets-unbox-rancher-20-v200

Basically We are initializing 1 Controller for 1 monitoring target resource type by passing ObjectClient which watch specific type of resource like that PodController is for Pod, DeploymentController is for Deployment….

After initialize Controller, we will be able to register “business logic” which is called as a handler into Controller. That’s why 1 Controller will have different type of business logic which need to be triggered when responsible resource got changed. This is the reason I mentioned it’s similar to Aspect-Oriented Programing Paradigm, because We can inject business logic after generated 1 Controller and 1 Controller will eventually have multiple different business logic.

back to the topic about the type of metrics rancher-server give us. rancher-server give us “Total Count of execution for each handler” and “Total Count of failure for each handler”. If goroutine leak is caused due to failure of Custom Controller, we may be able to find something interesting

Now we found weird value here, user-controllers-controller got failed more than 1000 times of others…

This is the hander which is registered into ClusterContrller which monitor cluster resource type and “be responsible to run UserController which is in charge of managing rancher’s provisioned kubernetes cluster”

Rancher Logs around user-controllers-controller

Usually when handler got failed, there is corresponding error logs.
So let us find out more information.

2020/04/08 11:15:38 [ERROR] ClusterController c-74lwc [user-controllers-controller] failed with : failed to start user controllers for cluster c-vwq2w: failed to contact server: Get https://192.168.0.2:6443/api/v1/namespaces/kube-system?timeout=30s: EOF2020/04/08 11:15:37 [ERROR] ClusterController c-xc627 [user-controllers-controller] failed with : failed to start user controllers for cluster c-vwq2w: failed to contact server: Get https://192.168.0.2:6443/api/v1/namespaces/kube-system?timeout=30s: EOF...

As we expected, there is bunch of error logs which indicate “failed to start user controller against specific cluster”

From here, we can know rancher-server keep failing to start User Controllers because specific Cluster got unavailable now.

Summary of Observation

Here is what we know so far.

  • The number of goroutine keep growing together with Memory Usage
  • 99% of goroutine are evaluating workqueue.(*delayingType).waitingLoop
  • 81% of heap-space usage are spent around ClusterManager.toRecord, rbac.NewAccessControl…
  • user-controllers-controller handler got failed bunch of times due to failure to connect to specific cluster
  • rancher-server kept failing to start User Controllers for specific Cluster due to failure of access to certain Kubernetes Cluster

In addition to above observation, I already knew how generally rancher works especially around controller implementation as a pre-knowledge (quoted from lets-unbox-rancher-20-v200)

Basically All Rancher Controller implementations can be classified into different type of groups. They have different timing to trigger, they have different scope to monitor.

  • APIControllers are created/started per rancher-server
  • ManagementControllers are created/started per rancher cluster (you can run multiple rancher-servers with clustering mode)
  • UserControllers (including WorkloadControllers) are created/started per provisioned Kubernetes Cluster

When we start Controller, we run bunch of goroutine for that, and also APIControllers and ManagementControllers are initialized and started when rancher-server start. That’s why I could not imagine they are related to goroutine leak, because initializing logic is evaluated only when rancher-server start.

As for UserControllers, they are initialized/started per provisioned cluster. So I can imagine they may have the bug, which may forget/fail to stop goroutine which is no longer needed, because Controller initializing logic will be evaluated even after rancher-server start.

Hypothesis

I tried to build up hypothesis, assumption with Observation and above pre-knowledge as followings to decide further step to analyze goroutine leak root cause.

  • user-controllers-controller handler, which is one of the APIControllers is in charge of creating/starting UserControllers and ClusterManager.toRecord is representation of UserControllers
  • somehow user-controllers-controller handler is failing to stop UserControllers properly, when they need to stop and just left several goroutine which run forever, and then just re-create new Controller.

That’s why I tried to make it clear how exactly user-controllers-controller start/stop UserControllers at first, and actually inside that there is logic/scenario causing goroutine leak..

Analyze Observation / Understand Observations

Since I suspect the UserControllers initializing/starting/stopping logic based on observation and pre-knowledge, I read through rancher-server’s code around these.

This is the whole picture of initializing/starting UserControllers

Some of the parts I intentionally omitted. but almost all logics must be correct. Let me explain each part.

  • user-controllers-controller handler are triggered when Cluster object got changed or when full-sync got performed or when previous handler execution got failed

full sync, requeue is implemented inside GenericController inside Norman, so let me skip to introduce and let me just introduce “Change Detected” part here

https://github.com/rancher/rancher/blob/v2.3.2/pkg/api/controllers/usercontrollers/usercontroller.go#L34

24 func Register(ctx context.Context, scaledContext *config.ScaledContext, clusterManager *clustermanager.Manager) {
...
33
34 scaledContext.Management.Clusters("").AddHandler(ctx, "user-controllers-controller", u.sync)
35
...
  • ClusterManager is in charge of creating/starting/stopping UserControllers when need, and user-controllers-controller get all of clusters when it get evaluated and run start or stop depending on cluster status via ClusterManager.Start(), Stop()

https://github.com/rancher/rancher/blob/v2.3.2/pkg/api/controllers/usercontrollers/usercontroller.go#L107-L127)

80 func (u *userControllersController) sync(key string, cluster *v3.Cluster) (runtime.Object, error) {
....
87 return nil, u.setPeers(nil)
88 }
89
90 func (u *userControllersController) setPeers(peers *tpeermanager.Peers) error {
....
100 if err := u.peersSync(); err != nil {
.....
105 }
....
107 func (u *userControllersController) peersSync() error {
108 clusters, err := u.clusterLister.List("", labels.Everything())
.....
116
117 for _, cluster := range clusters {
118 if cluster.DeletionTimestamp != nil || !v3.ClusterConditionProvisioned.IsTrue(cluster) {
## Stop User Controllers119 u.manager.Stop(cluster)
120 } else {
## Start User Contollers121 if err := u.manager.Start(u.ctx, cluster, u.amOwner(u.peers, cluster)); err != nil {
....
128 }
  • ClusterManager have collection(map) of all of the UserControllers to maintain state of UserControllers for provisioned Cluster

https://github.com/rancher/rancher/blob/v2.3.2/pkg/clustermanager/manager.go#L38

33 type Manager struct {
34 httpsPort int
35 ScaledContext *config.ScaledContext
36 clusterLister v3.ClusterLister
37 clusters v3.ClusterInterface
## controllers(sync.Map) keep UserContext for each provisioned cluster 38 controllers sync.Map
39 accessControl types.AccessControl
40 dialer dialer.Factory
41 }
  • After user-controllers-controller run ClusterManager.Start which is in charge of starting UserController, It tried to run UserControllers as follows

https://github.com/rancher/rancher/blob/v2.3.2/pkg/clustermanager/manager.go#L107-L131

74 func (m *Manager) Start(ctx context.Context, cluster *v3.Cluster, clusterOwner bool) error {
...
83 _, err = m.start(ctx, cluster, true, clusterOwner)
84 return err
85 }
...
107 func (m *Manager) start(ctx context.Context, cluster *v3.Cluster, controllers, clusterOwner bool) (*record, error) {
## Check if we already have UserControllers or not
108 obj, ok := m.controllers.Load(cluster.UID)
109 if ok {

## If we have existing UserControllers, we additionally check whether there is change from last state of Cluster when initialize UserControllers.
110 if !m.changed(obj.(*record), cluster, controllers, clusterOwner) {
## If we detect no change, Make sure run UserControllers, and return to caller of ClusterManager.Start
111 return obj.(*record), m.startController(obj.(*record), controllers, clusterOwner)
112 }
## If we detect any change, evaluate cancel function of context(Stop UserControllers), Delete UserControllers from m.controllers113 m.Stop(obj.(*record).clusterRec)
114 }
## If we don't have UserControllers, we create UserControllers115
116 controller, err := m.toRecord(ctx, cluster)
117 if err != nil {
118 m.markUnavailable(cluster.Name)
119 return nil, err
120 }
121 if controller == nil {
122 return nil, httperror.NewAPIError(httperror.ClusterUnavailable, "cluster not found")
123 }
124
## After Created UserControllers, Put UserControllers into m.controllers125 obj, _ = m.controllers.LoadOrStore(cluster.UID, controller)
## If Cluster is available, run Controller.Start.
126 if err := m.startController(obj.(*record), controllers, clusterOwner); err != nil {## If startController got error like failing to access to cluster, we will mark cluster as a unable and run Stop (do cancel) and delete UserControllers from m.controllers inside that.127 m.markUnavailable(cluster.Name)
128 return nil, err
129 }
130 return obj.(*record), nil
131 }

Inside startController(line: 126), We are not just run UserContext.Start(). We also check API Availability before run UserContext.Start() https://github.com/rancher/rancher/blob/v2.3.2/pkg/clustermanager/manager.go#L133-L155

133 func (m *Manager) startController(r *record, controllers, clusterOwner bool) error {
134 if !controllers {
135 return nil
136 }
## Check if target Cluster's Kubernetes API is responding or not...
140 if _, err := r.cluster.K8sClient.CoreV1().Namespaces().Get("kube-system", v1.GetOptions{}); err != nil && !apierrors.IsNotFound(err) {
## if not responding, return to start() and evaluate cancel function of context, and Delete UserControllers from m.controllers, and then back to caller of ClusterManager.Start()141 return errors.Wrapf(err, "failed to contact server")
142 }
143
144 r.Lock()
145 defer r.Unlock()
146 if !r.started {
## If responding, evaluate GenericController.Start() via UserContext.Start(), and then if it's succeeded, return to back to caller of ClusterManager.Start().147 if err := m.doStart(r, clusterOwner); err != nil {## if GenericController.Start() is failed, evaluate cancel function of context(Stop UserControllers), and Delete UserControllers from m.controllers, and then back to caller of ClusterManager.Start()148 m.Stop(r.clusterRec)
149 return err
150 }
151 r.started = true
152 r.owner = clusterOwner
153 }
154 return nil
155 }
337 func (m *Manager) toRecord(ctx context.Context, cluster *v3.Cluster) (*record, error) {
....
## Create UserControllers against cluster343 clusterContext, err := config.NewUserContext(m.ScaledContext, *kubeConfig, cluster.Name)
344 if err != nil {
345 return nil, err
346 }
347
## At this phase, we have no actual GenericController348 s := &record{
349 cluster: clusterContext,
350 clusterRec: cluster,

## here, inside rbac.NewAccessControl, we generate 4 GenericController
351 accessControl: rbac.NewAccessControl(clusterContext.RBAC),
352 }
353 s.ctx, s.cancel = context.WithCancel(ctx)
354
355 return s, nil
356 }

Inside rbac.NewAccessControl we instantiate GenericController (https://github.com/rancher/rancher/blob/v2.3.2/pkg/rbac/access_control.go#L19)

 19 func NewAccessControl(rbacClient v1.Interface) *AccessControl {## Inside NewListPermissionStore, We generate 4 Generic Controllers 20         permissionStore := NewListPermissionStore(rbacClient) 
21 return &AccessControl{
22 permissionStore: permissionStore,
23 }
24 }

Inside NewListPermissionStore(https://github.com/rancher/rancher/blob/v2.3.2/pkg/rbac/list_permission_store.go#L10-L17)

10 func NewListPermissionStore(client v1.Interface) *ListPermissionStore {## Inside newIndexes, We generate 4 Generic Controllers 11         users, groups := newIndexes(client)
12 return &ListPermissionStore{
13 users: users,
14 groups: groups,
15 }
16
17 }

Here is actual code to generate GenericController for ClusterRoles, Roles, RoleBindings, ClusterRoleBindings (https://github.com/rancher/rancher/blob/v2.3.2/pkg/rbac/permission_index.go#L16-L56)

16 func newIndexes(client v1.Interface) (user *permissionIndex, group *permissionIndex) {
....
37 user = &permissionIndex{
## Generate ClusterRoles GenericController
38 clusterRoleLister: client.ClusterRoles("").Controller().Lister(),
## Generate Roles GenericController
39 roleLister: client.Roles("").Controller().Lister(),
## Generate ClusterRoleBindings GenericController
40 crbIndexer: client.ClusterRoleBindings("").Controller().Informer().GetIndexer(),
## Generate RoleBindings GenericController 41 rbIndexer: client.RoleBindings("").Controller().Informer().GetIndexer(),
42 roleIndexKey: "rbUser",
43 clusterRoleIndexKey: "crbUser",
44 }
....
55 return
56 }
  • As for ClusterManager.Stop, it is triggered when ClusterManager.Start() got failed or user-controllers-controller detect Cluster got deleted or got in progress of provisioning. The logic is pretty simple.

https://github.com/rancher/rancher/blob/v2.3.2/pkg/clustermanager/manager.go#L64-L72

64 func (m *Manager) Stop(cluster *v3.Cluster) {
65 obj, ok := m.controllers.Load(cluster.UID)
66 if !ok {
67 return
68 }
69 logrus.Infof("Stopping cluster agent for %s", obj.(*record).cluster.ClusterName)
## Evaluate context.cancel function which let all functions passed context do cancel logic (go language itself doesn't ensure cancel is performed correctly even if you passed context and perform context.cancel, it's depending on/responsibility of author of implementation) 70 obj.(*record).cancel()## Remove UserControllers for target Cluster from m.controllers 71 m.controllers.Delete(cluster.UID)
72 }
  • After UserControllers are removed from m.controllers, there is no reference at all, so when we removed UserConrollers from m.controllers, we need to make sure all of goroutine related to UserControllers are correctly stopped
  • When UserControllers are removed from m.controllers, UserControllers could be already started, could be not started
  • After instantiate GenericControllers via UserContext(UserControllers), workqueue object is also instantiated. when NamedDelayingQueue which is used by GenericControllers as a workqueue are instantiated, goroutine is started

https://github.com/rancher/rancher/blob/v2.3.2/vendor/github.com/rancher/norman/controller/generic_controller.go#L86-L105


86 func NewGenericController(name string, genericClient Backend) GenericController {
....
....
99
100 return &genericController{
101 informer: informer,
## This NewNamedRateLimitingQueue function internally spawn 1 goroutine, which will be stopped when ShutDown function got evaluated102 queue: workqueue.NewNamedRateLimitingQueue(rl, name),
103 name: name,
104 }
105 }

Confirm workqueue.NewNamedRateLimitingQueue implementation
(https://github.com/rancher/rancher/blob/v2.3.2/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go#L44-L49)

44 func NewNamedRateLimitingQueue(rateLimiter RateLimiter, name string) RateLimitingInterface {
45 return &rateLimitingType{
46 DelayingInterface: NewNamedDelayingQueue(name),
47 rateLimiter: rateLimiter,
48 }
49 }

Inside NewNamedDelayingQueue, we spawn goroutine
(https://github.com/rancher/rancher/blob/v2.3.2/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go#L41-L58)

41 func NewNamedDelayingQueue(name string) DelayingInterface {
42 return newDelayingQueue(clock.RealClock{}, name)
43 }
44
45 func newDelayingQueue(clock clock.Clock, name string) DelayingInterface {
46 ret := &delayingType{
47 Interface: NewNamed(name),
48 clock: clock,
49 heartbeat: clock.NewTicker(maxWait),
50 stopCh: make(chan struct{}),
51 waitingForAddCh: make(chan *waitFor, 1000),
52 metrics: newRetryMetrics(name),
53 }
## This is the goroutine!!! we are leaking actually. This is started when NewNamedRateLimitingQueue is instantiated and we need to call Shutdown if we want to stop this correctly 54
55 go ret.waitingLoop()
56
57 return ret
58 }
  • workqueue.Shutdown which stop goroutine, is triggered as a defer inside GenericController.Start. GenericController run workqueue.Shutdown only when context.cancel is called inside GenericControllers.Start
  • Even if we called context.cancel(), workqueue’s goroutine won’t be stopped if GenericController.Start has not been evaluated yet because workqueue.Shutdown was not evaluated

https://github.com/rancher/rancher/blob/v2.3.2/vendor/github.com/rancher/norman/controller/generic_controller.go#L233-L243

233 func (g *genericController) run(ctx context.Context, threadiness int) {
234 defer utilruntime.HandleCrash()
## Inside run function, We shutdown queue, which means, if we didn't start Controller, this function will not be evaluated
235 defer g.queue.ShutDown()
236
237 for i := 0; i < threadiness; i++ {
238 go wait.Until(g.runWorker, time.Second, ctx.Done())
239 }
240
241 <-ctx.Done()
242 logrus.Infof("Shutting down %s controller", g.name)
243 }

Root Cause

As much as possible, I tried to explain several pieces you need to know when I explain Root Cause of goroutine leak. Above analyzing section just give you implementation overview of UserController starting/stoping logic. So you need to connect each piece to find out goroutine leak scenario how leak could be happened in above implementation.

So here is answer…

  • When we instantiate client-go’s workqueue.NamedDelayingQueue, goroutine has been started immediately. That’s why you need to make sure you called Shutdown() against workqueue.NamedDelayingQueue before you release reference for workqueue
  • There is scenario which would release the reference of GenericController which have workqueue object without calling workqueue.Shutdown(), which cause goroutine leak.
  • If ClusterManager remove UserControllers without executing UserController.Start() from m.controllers, reference of GenericController will be gone without shutdown workqueue, because workqueue.Shutdown() is implemented inside Start() fucntion as a cancel logic
  • If Cluster is not reachable from rancher-server, ClusterManager would evaluate context.cancel() without calling Start, and then remove UserControllers from m.controllers. after that, we will generate UserController again and failed to access again and removed and over and over…..

After you checked this Root Cause section, maybe you can back to analyzing section again, probably this time you can get more clear picture, what’s happening.

Reproduce

After you read this post, you may want to reproduce memory leak.
So for you, let me leave how you can reproduce memory leak.

To reproduce goroutine leak in your rancher-server, you need to create following situation

“Keep let ClusterManager ‘call ClusterManager.Stop(Cluster)’ and ‘remove UserControllers from m.controllers’ without calling ClusterManager.doStart(), which call UserContext.Start() internally”

So here is one example scenario to reproduce.

  1. Run multiple rancher server with clustering mode (let’s suppose 3)
    (I didn’t explain details but actually you need to run multiple rancher-servers)
  2. Create many clusters as much as possible (the number of cluster affect speeds of goroutine leak)
  3. Make Kubernetes API unable on several clusters
  4. Wait for several days
  5. Check the number of goroutine

Actually 1 UserController left 4 goroutine, so it will take a long time to observe bunch of leak. of course it depends on the number of Cluster you have, the number of Cluster which is not able to access.

How Rancher v2.3.3 fixed?

Initially I didn’t plan to have this section, but after I wrote this post, I found out there is already patch. So Let me explain how exactly The patch(Fix goroutine leak on failed sync() or before) help this situation briefly.

As I mentioned above, the problem is that we didn’t call workqueu.Shutdown() after we create workqueue object.

ClusterManager would throw away reference for them after workqueue (as a part of GenericController) got created without calling Start() if Cluster is not responding.

That’s why above patch is making timing to create workqueue object delayed instead of creating at the same time as GenericController object created.

After above patch, workqueue will be created when GenericController.Sync() function is called. That’s why even if ClusterManager throw away reference of GenericController without calling Start(), there is no problem since workqueue object is not created yet.

Why Rancher v2.4.2 (latest) experience leak again

I commented in https://github.com/rancher/rancher/issues/21361#issuecomment-612500020 .

Rancher actually stopped to use Norman for ClusterManager’s RBAC cache, and started to use another controller framework which is called as a rancher/wrangler.

This framework doesn’t implement “workqueue delay creation” which is making sure workqueue got created when Controller.Sync() or Start() .

That’s why Rancher (ClusterManager) again started to create workqueue and throw away reference of workqueue without calling ShutDown().

If you understood what I explained in this post, you should be able to understand what I left in github issue . even you should be able to identify why Rancher v2.4.2 leaked goroutine around ClusterManager by yourself.

For this problem, we can solve with same solution as Fix goroutine leak on failed sync() or before. but this is not only the solution and this affect to behaviour of existing interface, so let me consider what is the best way after did workaround.

Take away

After I wrote this post, I realize there is not much take aways for pure Kubernetes Users.

So if you are not Rancher User, and you are just pure Kubernetes User, When you implement the logic to dynamically instantiate Controller and Start, You need to be aware of that workqueue.NamedDelayingQueue will spawn goroutine right after object is created. That’s why you need to make sure you would call Shutdown against workqueue. otherwise you will hit same situation as rancher faced.

If you are Rancher User, you may face same goroutine leak when your situation satisfy reproduce condition, that’s why better to upgrade to future release which will include the fix of goroutine leak, although it’s not fixed yet as of 2020/04. Please keep watch https://github.com/rancher/rancher/issues/21361.

--

--