Trust the Batch Process: How I spent a weekend making four tiny dots

Graham Doskoch
5 min readJul 16, 2019

I don’t always analyze hundreds of gigabytes of data over the weekend, but when I do, I really hope it finishes by Monday morning.

At Nevis Labs, when my group — VERITAS — works with large datasets, i.e. every day, we use a technique called batch processing. Here’s what this entails for me:

  • I SSH into the (relatively powerful) Nevis servers on my (relatively wimpy) laptop, and put together a list of numbers that identify the relevant data files I want to run. I also modify a configuration file that provides all the information about how to perform the analysis.
  • When I’m ready, I run a Python script one of the graduate students wrote to interface with Condor, the batch management system we use.
  • Condor scowls at me, then looks at all the workstations on the network — desktop PCs scattered around campus — and identifies ones that aren’t being used. It delegates batches of analysis tasks to them.
  • Eventually, the resulting files — log files, some plots, ROOT files, etc. — are saved into my Nevis directories, and I can dig into the results.

This extremely repetitive work is what I’ve been telling Condor to do for most of the past 96 hours. Nonstop. Which is why at this point I think it wants to kill me.

Everything’s going fine. For now.

I of course don’t casually process hundreds of gigabytes of data each weekend for the fun of it. The reason I was willing to risk a war with the machines is that every Monday morning at 10:30 AM, we have a two-hour group meeting where we all discuss our individual work from the past week. Going into this morning, I was really excited about the PowerPoint presentation I’d put together. I’m so close to being done with my analysis of the Crab Nebula, which involves seeing if applying correction factors to VERITAS observations successfully accounts for drops in the telescope’s gain and throughput we’ve been seeing — decreases in signal detection and transmission. It’s an issue many people in the collaboration are worried about.

If I could get Condor to like me for just 96 hours, I could get half of my Crab analysis done by the meeting. Those results would detect whether individual telescopes in the VERITAS array are behaving badly, by looking at subarrays — ignoring data from one telescope in the array at a time, and analyzing only the signal from the remaining three. If one or two telescopes are acting up (and previous work had suggested that this could be the case), then this should show up really quick. That’s not a result we want. I went to bed on Sunday hoping that I could get enough data processed to disprove it.

The great thing about batch processing is that it allows analysis to be performed in parallel, decentralized. It can figure out which workstations and nodes in the network are free, and delegate the right amount of work to each of them. If one particular bit of analysis — a job — fails or hits an error — that failure won’t affect the whole batch, just part of it.

Plus, Condor gives me the ability to manage everything in real time. Both condor_status and condor_q give me information about the current jobs on the system; adding my username tells me about the ones I'm personally running. The thing about me using condor_q, of course, is that it's surprisingly addictive. Checking your batches is much more addictive than checking, say, my email. Which I already seem to do incessantly.

I can also remove jobs either individually or en masse as needed. I have, regrettably, become intimately familiar with condor_rm, which can, in an instant, get rid of dozens of CPU hours or work. I've invoked it on three separate occasions this weekend after realizing I incorrectly modified a file. The good news? I saved myself half a day of worthless analysis. The bad news? Condor definitely hates me for waking it up early on Sunday for nothing.

I went into Monday’s meeting with four data points. That’s it. Four data points. Here they are, in all their glory — three from the 2017–2018 observing season, and one from the 2018–2019 observing season. Each dot represents the average flux observed coming from the Crab by a given subarray, counting only photons with energies greater than 300 billion electron volts (300 GeV, or 0.3 TeV):

Yes, I spent a weekend generating a mere four data points. And yes, that top left one is wrong; I know that because I ran the analysis a second time this morning. Here’s that left-hand plot — fixed, with error bars, units, and with the three data points spread out a bit:

Basically, there’s no variation between the three subarrays I’ve looked at in 2017–2018 data, and I’ll bet my summer stipend that the same will be true in the 2018–2019 data.

Four data points sounds like very little, and it sounds like very little only because I’ve been spinning it that way. In reality, I’ve generated dozens of data files containing more numbers than I could count, with uncertainties, error bars, upper limits, fits, and more. Those four data points were the only ones I presented at our meeting, but there were over 40 output files behind them, some with thousands of lines in information.

Condor may sometimes take its sweet time getting my data processed, and I’m only a day ahead of schedule instead of two days, and I’m still running gigabytes more of data through our servers, and I almost titled this post “Bitching about Batch Processing”, but if I’m being fair, none of what I’ve been doing for seven weeks would be possible without Condor. Yes, it can be annoying and hard, at times, to see what’s going on behind the scenes. Yes, its methods for allocating resources can be opaque and nonsensical.

But at the end of the day (i.e. literally right now)? I don’t always analyze hundreds of gigabytes of data over the weekend, but when I do, I use batch processing. And I have never felt so good about four tiny points on a plot.

The Crab Nebula in gamma rays — also a tiny point, from the point of view of VERITAS. Image credit: NASA/DOE/Fermi LAT/R. Buehler, CC BY-SA 3.0. Image cropped

You can read my introduction to my research this summer here. Tomorrow’s blog post topic? My increasingly severe addiction to shell scripting with Bash.

--

--

Graham Doskoch

PhD student in radio astronomy. Pulsars, pulsar timing, radio transients, gravitational waves, and the history of astronomy.