Trust the Batch Process, Part 2: Kvetching about Vetch

What happens when a server goes bad

5 min readJul 23, 2019

Last Monday, I wrote about batch processing, which is how we at Nevis Labs split up large amounts of data processing between a whole bunch of machines. I might have come off as slightly . . . grumpy about the whole thing, largely because our batch management system, Condor, can be a bit hard to debug. On the whole, though, it works well — but as it turned out shortly after I wrote that blog post, it’s been causing a major problem.

When my group submits jobs to Condor, each job is involved with one of three of our servers, Ged, Serret, and Vetch (our fourth server, Tehanu, is used for other things). It turned out that about one out of every three jobs we submitted was staying “idle” — it wasn’t running, but it also wasn’t being rejected (“held”). Digging into Condor told us that this was happening for every job run on Vetch, and that the jobs were idling because the system requirements for the jobs weren’t being met in some way.

Basically, Vetch was taking a long lunch break and was refusing to clock back in at the end of it.

That’s not great. We submit anywhere from 50 to 200 jobs per day, I’d guess, between four of us, and having all the Vetch jobs not run is terrible. We didn’t know what was going on, although we came up with some hypotheses (maybe the memory is full, with 3 TB taken up by some super old simulations we don’t need to keep around). Over the weekend, I came up with a stopgap measure by just hardcoding Vetch out of our submission script, which meant editing literally 4 lines out of 1109. But that’s inelegant and still means the script can’t access data files on Vetch, and so in theory on average one third of the data we use each time around has to be redownloaded onto the other two machines, Ged and Serret — and that takes time (unless you transfer files manually in advance). None of it scales.

So yesterday morning, we had an impromptu meeting with the sysadmin, Bill, to puzzle it out, and Bill put together a new hypothesis, which is mundane but makes a lot of sense. Over the past month, the research building has been undergoing roof repairs (ooh, fun, asbestos removal!). The air conditioning had to be turned off, and backup units were installed, but it turns out they couldn’t handle the load, and the VERITAS cluster crashed. Now, around that time, the cluster was due for a Condor-related update — and that update seems to have not taken place. The whole thing happened just before our Vetch jobs started idling. Coincidence? Maybe not.

The upshot of all of this is that Bill might try to restart Vetch within the next day or so and maybe redo some installations. If it works — well, that’s great. Then I can change those four lines in the script back, and all of a sudden the cluster will be less overloaded when we throw a heck-ton of jobs at it, since the script will have four servers to work with instead of three. Fingers crossed.

This whole thing had been a week-long headache, and by the end of yesterday, I really needed some good news — and I got it. A side issue that had been affecting me personally for a couple weeks was an apparent inability to insert correction factors into our data analysis pipeline, VEGAS. I would edit the configuration file for a set of runs to add in those factors, then send it off through VEGAS. Out the other end popped another config file, which I ran through a final stage of VEGAS, before getting results — or, in all the cases where I used the corrections, something totally different:

I love the smell of null pointers in the afternoon.

After talking with one of the grad students in the group, I found that the culprit was a single line in one of the Bash scripts I was using, which I hadn’t updated when I installed VEGAS. I edited that line and whacked a couple of unrelated bugs, and I’ve been applying correction factors to data for almost 24 hours now. It’s actually on track to finish by Friday afternoon, which would be great.

Science time!

I’d run into those VEGAS issues when dealing with data from observations of the Crab Nebula, the primary source I’m studying this summer. However, over the weekend, I had done some analysis on BL Lacertae, an active galactic nucleus (AGN) and my secondary source. The Crab appears extremely bright in gamma rays; BL Lac — not so much. That’s not BL Lac’s fault; after all, the AGN is almost a billion light-years away from Earth. Nothing appears particularly bright when it’s that far away.

It did mean that I ran into some trouble when I performed the final stage of my BL Lac analysis, as I found out Sunday night. The observed mean fluxes were so low that VEGAS actually couldn’t find any significant energy bins. And believe me, the software told me that in no uncertain terms.

This was vexing me enough that I brought it up at our Monday group meeting, during my allotted time, and we did some joint brainstorming. Our postdoc made two good suggestions for what I could do:

Focus just on the two nights this spring during which the flare I’m interested in was observed; this should raise the mean values in the energy bins. VEGAS doesn’t know that I expect it to see a flare in the data.
Use “soft” energy cuts rather than “medium” energy cuts. BL Lac’s gamma ray spectrum is soft, meaning it’s approximately a power law that falls off slowly with increasing energy. Modifying the cuts used by VEGAS will helpful the software better figure out what sort of spectrum to expect.

The takeaway here? If you don’t tell VEGAS what to expect, it’ll perform the data analysis poorly. An analogy might be to fitting a curve to a set of data points. You need to choose the sort of curve to fit, and you need to know which data points are important. If you let your fitting software go in relatively blind, all you’ll get are unphysical results.

Okay! That’s basically all I have for the past two days. Tomorrow, I and the other REU students at Nevis will be heading to Brookhaven National Laboratory for a tour, while VEGAS keeps churning through my data while I’m away. More on that when we get back!

You can read my introduction to my research this summer here. Tomorrow’s blog post topic? A trip to Brookhaven National Laboratory!

Trust the Batch Process, Part 2: Kvetching about Vetch

What happens when a server goes bad

Written by Graham Doskoch