I don’t often write “in the trenches” stories about production issues, although I enjoy reading them. One of my favorite sources for that kind of tale is rachelbythebay. I’ve been inspired by her writing in the past, and I’m always on the lookout for opportunities to write more posts like hers. As it happens we experienced an unusual incident of our own recently. The symptoms and sequence of events involved make an interesting story and, as is often the case, contain a couple of valuable lessons. Not fun to work through at the time, but perhaps fun to replay here.
Some background: where I work we run a real-time chat service that provides an important communications tool to tens of thousands of businesses worldwide. Reliability is critical, and to ensure reliability we have invested a lot of engineering time into monitoring and logging. For alerting our general philosophy is to notify on root causes and to page the on-call engineer for customer-facing symptoms. Often the notifications we see in email or slack channels allow us to get ahead of a developing problem before it escalates to a page, and since pages mean our users are affected this is a good thing.
On a recent Tuesday morning we began to see notifications in our monitoring channel for rabbitmq queue backups. We use rabbitmq extensively to distribute work among our services, and queue depth is one of our leading indicators of service health and performance. Two queues were backing up: one that handles presence events to operator groups, and one that handles chat messages from visitors to operators. A slowdown in these queues meant that users were not getting chat messages promptly delivered, and if it continued the alert would escalate. We immediately started investigating. The senior engineer responsible for our core chat architecture quickly noticed that a set of pods running consumers for these queues was crashlooping. All 40 of them. We called a downtime and hopped on Zoom.
Experienced kubernetes users will know that “crashlooping” means our containers were failing, and when the kubelet attempted to restart them they were failing again. If the pattern continued the system would back off from restarting them, on the often reliable assumption that things were not going to improve by themselves. The question in a case like this is “what’s killing the containers?” This particular service had last been deployed over two weeks prior, so it wasn’t one of those straightforward errors that sometimes manifests right after a change. Accessing the pod logs with kubectl logs
was not very helpful: the containers came up and worked fine for a moment or two and then died without complaint.
We next checked events on the pods with kubectl describe
, which revealed “Backoff restarting failed container.” That message, while not explanatory, did give us a hint: if a container process isn’t exiting improperly then it must be getting forcibly terminated. One reason this can happen is that the pod is not responding to liveness probes from the kubelet. Those events are recorded, however, and the lack of any such told us that it wasn’t probes failing that caused these pods to die. That left basically one other alternative, which was that they were getting “oomkilled”. Since the workload had been in production for over two weeks we were initially skeptical, but checking the kernel messages on a node with failing pods confirmed the theory:
[ +0.008842] Memory cgroup out of memory: Kill process 3187861 (python) score 12 or sacrifice child
[ +0.009387] Killed process 3187861 (python) total-vm:3418276kB, anon-rss:1563932kB, file-rss:45584kB, shmem-rss:0kB
[ +0.176620] oom_reaper: reaped process 3187861 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The pods were running with a cgroup memory limit of 1.5 gibibytes (Gi), and as you can see (in anon-rss
) that’s right about where this one was when it got stomped. The puzzling thing for us was that the pods normally didn’t use anything like that much ram. In fact the memory request for those pods was only set to 500 mebibytes (Mi), and metrics told us they were normally using only about 300 Mi of that. The fact that the requests and limits were so out of line with each other caught our eye, but we considered it to be something to review later, and not a contributor to the current problem.
So something was causing the pods to allocate a lot more memory than usual. Could we get things stabilized by raising the limit? The nodepool that runs our general workload population autoscales, so we could be confident of having enough resources, although the deploy would take longer as nodes were provisioned. We tried a limit of 3 Gi, and then 4, and in both cases the pods immediately consumed it all and died. We tried investigating the memory usage in our metrics but they were a mess. This wasn’t totally surprising, and it points out a limitation of scraped metrics when trying to use them for realtime debugging: the pods were simply dying too quickly for prometheus to gather enough samples given a 60s scrape interval. It was like being locked in a dark room where you’re only allowed to flick your flashlight on for one second once per minute.
At this point a theory percolated up which we didn’t initially give much weight to: a specific message in rabbit was causing the problem, and it was getting redelivered every time the pods failed and restarted. This seemed improbable because all of the message formats had been stable for years, and as for the code... well I should mention at this point that when we had last deployed this thing two weeks before it had been a pretty big change. The code that consumes the messages was migrated from a deprecated python 2 service to an existing python 3 service. It was the same code, but obviously not exactly. You would expect that to set off all sorts of alarms for us, and it did. We definitely suspected the new version was at fault, but since it had been up for two weeks at production volume that itself was still a puzzle.
Working through these steps took us around thirty minutes, during which we kept the service usable through a lot of manual pod herding. Since we had failed to come up with a cause the next thing we would usually do is revert the last change. The previous version of this service was still deployable and could happily run alongside the new version, so we hit the button… and an unexpected thing happened: not only did the old version of the service come up and run, but it immediately stabilized the new version. All of the crashlooping pods readied and their memory utilization leveled out at the expected 280–300 Mi. After monitoring for a period we scaled the old service down to no replicas and removed it, and the new one continued chugging along as if nothing had ever been wrong.
The usual “Fixed!” jokes circulated in slack, but dark humor couldn’t mask the fact that we had a bomb that might go off again at any time. As things had stabilized we cleared our downtime and released the response team, but until we had a root cause the issue would not be considered closed. Fortunately we have good logging and metrics coverage across our systems, so we began digging through the event data from the affected service. It wasn’t long before we discovered that a specific rabbit message had been reprocessed over 1000 times during the event. An engineer fed that message to the service in staging and it immediately detonated. He spun the service up in a local docker container and the message crashed it there too. So… uh, about that unlikely theory we had discarded earlier.
The message in question contained the text of a chat line sent from an operator to a visitor, and embedded in that chat line was an Amazon product link with something like the following string in it:
some_text%2455636754%more_text
Anyone reading who happens to be an old-school pythonista may at this point be having an “aha!” moment. I am not really an old-school pythonista, and in python 2 I have mostly been exposed to string formatting that looks like this:
>>> "hello {}".format('world')
'hello world'
In python 3 I like to use so-called “f strings”:
>>> sub = "world"
>>> f"hello {sub}"
'hello world'
But before either of those things became a thing there was “% formatting”:
>>> "hello %s" % 'world'
'hello world'
The basic idea here is that %s
is a placeholder for a string, and the variable or tuple after the % operator provides the value(s). There are other types as well:
>>> "Over %d" % 9000
'Over 9000'
There is also optional padding:
>>> "Over %10d" % 9000
'Over 9000'
This style has been around a long time, and the grey-haired among you may recognize some inheritance from C’s printf formatting. The official python documentation recommends against using % formatting because it “exhibits a variety of quirks.” This seems like a true statement. One interesting quirk has been present in various versions of the language since python 2 days:
$ docker run -it python:2
Python 2.7.18 (default, Apr 20 2020, 19:27:10)
[GCC 8.3.0] on linux2
>>> "Over %10%" % ()
'Over %'
That is, as far as I know, not a properly constructed format string. In current versions of python 3 it produces an error:
$ docker run -it python:3
Python 3.8.2 (default, Feb 26 2020, 14:58:38)
[GCC 8.3.0] on linux
>>> "Over %10%" % ()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
However for this specific service we run python 3 with the pypy runtime, and the incorrect behavior was present there through version 7.3.0:
$ docker run -it pypy:3.6-7.3.0
Python 3.6.9 (1608da62bfc7, Dec 23 2019, 10:50:04)
[PyPy 7.3.0 with GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux
>>>> "Over %10%" % ()
'Over %'
It was “fixed” in version 7.3.1:
$ docker run -it pypy:3.6-7.3.1
Python 3.6.9 (2ad108f17bdb, Apr 07 2020, 02:59:05)
[PyPy 7.3.1 with GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux
>>>> "Over %10%" % ()
'Over %'
Still not the same behavior as current python 3 on cpython, but at least it doesn’t misinterpret the string as padding for a literal. So back to the text of that Amazon link:
some_text%2455636754%more_text
The python logger message format uses % formatting, and if this string is the first argument to a formatting expression with an empty second argument, and you’re on one of the affected python versions… well congrats, you just asked for 2.4 billion spaces of padding, and that is how one chat message can burn down production.
The underlying lesson, of course, is don’t run user input through a % formatting string. In this case we didn’t quite realize we were until it blew up when trying to log the contents of this one chat message. Why didn’t that string choke the old version of the service? The short answer is that it was using an old in-house extension to the python logger that handled bags of args to be formatted into a log message using a custom dictionary, so these were not run through the logger’s message formatting. When we migrated to the python 3 version we cleaned that logger up and made it more idiomatic, and at the same time made ourselves vulnerable to this error.
Another lesson, perhaps, is to trust what the system is telling you. One of the advantages of kubernetes is that it makes it easy to set cgroup constraints for workloads and get clear signals when they are violated. We spent some time in this incident disbelieving the system because the workload had been running for a good amount of time, and the idea of it suddenly deciding to consume multiple gigabytes of ram per pod just didn’t fit into our mental models of what might be happening. To wrap up I want to acknowledge my friend and colleague Nick MacInnis, who did the detailed log analysis that revealed the source of the problem, and chased it down to our specific pypy version. Nick gave a presentation to San Diego Python on August 27 that goes into some more details on the error.