Go: Avoiding Bare Channels

Lesson learned from Cloud Foundry’s Loggregator

tl;dr

A “bare channel” is a channel that is written to outside a select. They can result in go routines being blocked when they are no longer being consumed properly.

Loggregator

Back when I first started on Cloud Foundry’s Loggregator (logging and metrics system for Cloud Foundry), the project used channels just about everywhere. It was a natural fit as Loggregator intended to be a “dumb pipe” for logs. However, we kept running into go routine leaks…

For the gophers out there that haven’t encountered a go routine leak, consider yourself lucky. The symptoms were severe: servers completely exhausting their CPU and memory resources and the system falling into a degraded state. When we would dig in, we would often find the same thing over and over again:

func (u User) WriteLog(l Log) {
u.logs <- l
}

Looks innocent enough right? Wrong. What happens when that user disconnects and stops consuming from logs? As it turns out… logs will start to block. So whatever called WriteLog() will also block... And so on.

This is clearly bad. In fact how did this ever work? Wouldn’t some simple tests reveal that this is broken? Well, when someone wrote this and they got the tests passing, by invoking WriteLog with a go routine:

go u.WriteLog(l)

This isn’t some magic fix though is it? It simply strands that go routine and holds onto u so the garbage collector can’t deal with it.

We started calling those “bare channels”. A channel that is written to outside of a select statement. Essentially, you want to be able to bail out of writing to a channel after a given amount of time. Loggregator made the decision when it first started that it would rather deliberately drop a few logs instead of pushing back on the Cloud Foundry application. Therefore we started changing the code to look more like this:

func (u User) WriteLog(l Log) {
select{
case u.logs <- l:
default:
}

This might look a little unforgiving, but as long as the channel has a large enough buffer, then this should not happen too often. This would cause Loggregator to drop logs instead of getting jammed up and eating away system resources. We later decided we wanted to drop the older messages (instead of the newer ones) and started using something we called a Diode, but I won’t go into that here.

In some cases dropping seems… well wrong. So then should the operation just be left to hang? Of course not! This is where a context shines:

func (u User) WriteLog(ctx context.Context(), l Log) {
select{
case u.logs <- l:
case <-ctx.Done():
//cancelled
}

This will wait for either logs to consume the data or the context to be cancelled.

Conclusion

Be aware that a channels’ concern exist beyond where they are directly interacted with. If the consuming go routine dies/slows (e.g., channels used with network I/O), then downstream go routines may be inflicted in less that desirable ways.

Go offers mechanisms around this (e.g., select), but often developers don't think to reach for this.

Is this real code?

Not really… The listed examples are super simplified to help bring an idea in a concise fashion. If this problem excites you or you want to see what the code really looked like, take a look at Loggregator’s repo!