The first solution isn't always best

Rudy Winnacker
Operations Engineering
2 min readFeb 22, 2013

I was digging through some logs just trying to get used to reading them. We were logging latency metrics per-minute and so hadn't noticed the issue, but for some reason on this occasion I bucketed latencies from a sample per-second. It was easy to notice that response times were spiking during the first two or three seconds of each minute.

This was bad, and the developers agreed, as did the administrators of the service. But it was not clear whose problem it was. Since it was happening regularly, this looked like a system-level issue.

When you see this kind of on-the-second pattern you'll immediately look into your system scheduler process, crond. It is the canonical way to run jobs on a regular schedule.

Sure enough, when I turned this scheduler off, the latency spikes stopped. When I turned it back on, and no matter how far I pared down its jobs, the three-second latency spikes returned. The conclusion I heard after sharing this empirical data with engineering was that we needed to turn off the scheduler.

Older systems didn't show this problem with the same jobs. Still, I was encouraged to turn the scheduler off since we wouldn't be using older systems much longer, and apparently on newer systems it was the problem.

This didn't feel right. The crond process is a venerable tool that has been in use for decades and it was hard to believe we would be among the first engineering organizations to find a case in which it needed to be disabled. I tried to understand better what was going on.

I found that system-level security auditing had been turned on in the newer system image, and the operations required for it when jobs were run by the scheduler caused the service to hang while the security audits happened.

This kind of security isn't particularly useful for a high-availability service that is secured in other ways. So it was turned off, and we were able to keep running scheduled jobs under crond without latency spikes.

There seemed to be consensus around turning off the scheduler mainly because that was the first experiment that corrected the latency issue. The crond service is invaluable, and its absence would have created a lot of work trying to replace its functionality. It is true that, in a sense, it caused the problem, but only because of another unnecessary change that triggered this behavior.

The first solution isn't always the best. If it feels wrong, it probably is.

--

--

Rudy Winnacker
Operations Engineering

Operations engineer, formerly with: Twitter, Google, Blogger.