Beyond Cargo Cult troubleshooting

Diego Doval
13 min readDec 12, 2018

--

We need an upgrade for how we think about troubleshooting in the Information Age.

You’re trying to download an update and something fails. Your device complains that “the update can’t be validated” or some such. You reach out to a couple of technical friends and they start giving you instructions: reboot your device, turn things on and off, disable packet flooding on your router, port scan detection, reboot your modem…. and the list goes on.

This is what I call “cargo-cult troubleshooting”.

it’s a behavior we engage in that is all-too-common in solving problems with complex systems. I’ve seen this before, I’ve done it before (I think we all have at one point or another) and it seems interesting to figure out why we do it — and how we can improve.

Why “cargo cult troubleshooting”? Wikipedia has a good article on Cargo cults. Briefly (and avoiding the religious overtones), we could say that Cargo cults attempt to reproduce the observed conditions under which something happened thinking that it’s those conditions — and not an external factor outside of your control — that made it happen.

A walk down troubleshooting lane

To look at why this happens, let’s start with this particular example — let’s go through this specific problem-solving process and then look at some possible root causes.

Problem

  • Downloads for software updates from Apple fail repeatedly to validate (and therefore install) from every machine in your house.

Diagnosis

What we know with a high degree of certainty is:

  • The files must be either incomplete (ie a broken download) or the download completed but was somehow corrupted, therefore failing the validation against their signature.
  • As a first step, you verify that downloading from a different geographic location produces no errors. Therefore the problem is location-specific. This rules out a widespread Apple problem.
  • Because you also have multiple machines at home, you can verify that it’s not machine-specific. This rules out problems in one machine, or a failing disk drive.
  • Rebooting router, machines, etc., has no effect, so the problem isn’t related to the state of the machines on your end.
  • It’s Apple-signed-binaries-specific. Other downloads of any other type work fine for you, including other downloads from Apple, such as visiting their website (in essence, a download of HTML, CSS, JS, images and other data, I am assuming that apple.com and other Apple properties work, and this is an important clue). Even more, it seems to be at the very least Mac-specific, so that iOS installs work. I am also assuming, probably correctly, that iOS app install/update was unaffected, even when using the same network. iOS apps are also signed and distributed by Apple using the same infrastructure, so this is another important clue.
  • Searching online shows many other people, in several different locations, reporting apparently the same problem. The “apparently” will be something we’ll come back to later, but for the moment let’s assume it is the same problem.

The easy part of the diagnosis is over. Before starting to fiddle with all settings everywhere, let’s see how much farther we can go in identifying the cause just by considering options instead of pressing buttons. We’re left with a few components that could be the root cause:

  • Apple’s servers/process related to Mac downloads,
  • Apple’s CDN (probably Akamai)
  • more broadly, the route between your house and the servers
  • your local network
  • your ISP
  • your router

Start with the possibility that the downloaded binary is complete but corrupted/invalid. We know that TCP sockets, which underlie HTTP connections, have error correction built in. A valid TCP connection will deliver the same data at the receiving end that was sent at the sender end, so the file that is arriving at your machine (if complete) will be what the server sent (short of a highly sophisticated, and thus uncommon, man-in-the-middle attack). Additionally, since we know that other downloads work, in particular other non-signed-binary downloads from Apple work, the network route end to end is fine, and so is your ISP, at least as far as full downloads are concerned.

So if the file is downloaded fully but still broken it means that the server is sending a full, but corrupted or incorrectly signed file. This is the first possible cause.

Now, as far as the download being broken or incomplete — Is it possible that due to bizarre settings or a bug your router is bailing out (or your ISP blocking traffic) after downloading some amount of data, therefore leaving you with an incomplete file that is reported as complete? Unlikely, perhaps, but not impossible. The fact that it only happens with Apple binaries makes it even less likely (a bizarre Apple-server-specific setting in your router perhaps? ISP rate-limiting?). Similarly the likelihood that this could be a widespread problem and be a characteristic of some modem that somehow only affects Apple’s CDN servers is also low to nonexistent. However, let’s say this is the second possible cause.

The rest of the scenarios involve either Apple’s signing process failing or one of Akamai’s or Apple’s servers involved in the storage having a corrupt image, disk, or software problems that then serves out invalid binaries to a location for a specific time. This is the third possible cause.

So we have whittled it down to three possible causes:

  1. Apple or the CDN is serving a complete but corrupt (or incorrectly signed) binary. In the case of an incorrectly signed binary, this can be only Apple’s, and not the CDN’s, problem
  2. Your ISP or router is consistently interrupting only signed-binaries, and only from Apple (keep in mind we already decided that the possibility that a complete file was corrupted en-route short of an attack taking place was nonexistent because even in the rare case of a bizarre (or even unheard of) router malfunction, the probability that this bug was affecting just apple signed binaries was similar to that of an elephant suddenly levitating due to quantum fluctuations around the elephant)
  3. Apple or the CDN are serving incomplete (and therefore broken) downloads

Let’s look just at the second possible cause vs. the other two as a unit for a moment. Occam’s razor comes to our aid: (paraphrasing) the simplest explanation is most likely the correct one. Is it more likely that some bizarre setting in your router (or ISP) is corrupting binaries only of a particular type from a particular company in a particular location, or is it more likely that everything works fine (as it does for all other cases) up to the server source, and that it’s Apple or the CDN that is just serving a broken file?

The latter is more likely, and a “simpler” explanation — although it may seem unlikely for reasons I’ll touch on later (the “I broke it” vs “it broke” issue). For the moment, suffice it to say that we automatically assume that Apple would not let this happen, but if you remove that assumption (which is incorrect, again, more later) then things become more clear.

While rate-limiting from your ISP or router, or some other router configuration related exclusively to Apple’s servers is possible, if unlikely, it is pretty much impossible that “normal” content not be affected.

This is a critical point that I mentioned above: None of the reports mention problems navigating to Apple’s site, or downloading any other type of content from Apple (such as trailers, movies, music, etc.). Even more, many if not most of these people are likely to have iPhones, iPads, and iPods, all of which also require signed content downloads, and are served from the same infrastructure, and therefore under the same conditions, as other Apple updates. If all data coming from Apple, including its website, was failing to load, that would be a much simpler (and fundamental) problem, which perhaps could, for example, involve DNS settings.

This leaves us with the first and third options, specifically pointing at Apple and not merely the CDN component. Why? CDN storage being at fault could be a culprit but only if this is a rare, random, and quickly fixed situation. Akamai, and all CDNs, have sophisticated infrastructure that will take out “bad” machines out of rotation quickly. Apple (which as far as I’m aware uses Akamai for many things, along with their own infrastructure) no doubt has that type of infrastructure too.

This leaves us with the most probable cause: a repeatable problem that persists for a single location and happens over a span of time, in Apple’s signing process or the file generation/copying that surely follows it. This would point to some bug in custom software on the server side being involved, in which Apple is signing binaries and only randomly corrupting them which ends with a complete file that doesn’t pass signature checks. Given that the problem seems to be limited to some locations consistently, we could also guess that there’s something about those locations by themselves or in interaction with the signing or copying process that is breaking the binaries, perhaps an older part of the infrastructure that is not easily solved because it hasn’t been migrated to new systems for example, or some difference in the environment (network time issues are common) that creates locally valid but globally invalid signatures.

The result of the analysis says that this isn’t your problem, therefore there’s no need to fiddle with settings or call your ISP, since that won’t solve the problem. You can only wait for Apple to fix it (perhaps report it to them) and in the meantime get the binaries from another location, like Dan is doing.

Is this absolutely the right diagnosis? I don’t know, of course. Based on the data we have so far, this seems reasonable, and I do know that if I was faced with this problem, I would either download the software from the location that works, or just sit quietly and fume (yeah, more likely the first option :)). I wouldn’t waste a minute fiddling with router settings. Maybe, if I was feeling somewhat desperate, I would reboot the broadband modem, hoping that by doing that I may get assigned another IP by the ISP and perhaps, maybe get assigned to a slightly different geographic location by Apple where things may be working.

In any case, the specifics of this case aren’t what interests me, what interests me is how solutions that are highly unlikely to affect the true root cause of a problem are accepted, and then spread, online and offline.

Where does cargo-cult troubleshooting come from?

Cargo cult troubleshooting leads to solutions that are closer to “stand on one foot and whistle quietly” than something that actually goes at the root cause of the problem, that is, they don’t actually fix the problem at all. But if so, how do these things get started, and then spread, in the first place?

As for how they get started, the most likely source is variables out of your control.

Let’s look at the example again: Update fails repeatedly. You start trying to fix it and as long as you keep trying things, the likelihood that (if the problem is in Apple’s end) it will be fixed by them increases significantly. So you do thing #785 and suddenly it works! Only you didn’t fix it. Apple did.

Because there’s a giant variable (or more accurately set of variables) that you don’t control on Apple’s side along with all the infrastructure in between, you can never really know what fixed it, especially if you’re trying things for a long enough amount of time (say, 1–2 hours at least). Unless you show it is repeatable, which we almost never do.

That is: propose that switching feature X breaks Y. Switch X off. Show that Y now works. Switch X on. Show that Y now doesn’t work. Do this three or four times. But that’s not what we usually do. We usually just get something working, are happy that the pain is over, and move on.

There’s an interesting aside to this in terms of why we assume that the problem is on our end first, rather than the other. It’s what I call the “I broke it vs. It’s broken” mindset, of which I’ll say more in another post, but that in essence says that with computer systems we tend to look at ourselves, and what is under out control, as the source of the problem, rather than something else.

This is changing slowly in some areas, but in a lot of cases, with software in particular, we don’t blame the software (or in this case, the internet service). We blame ourselves. As opposed to nearly everything else, where we don’t blame ourselves. We say “the car broke down,” not “I broke the car.” We say “The fridge isn’t working properly” as opposed to “I wonder what I did to the fridge that it’s no longer working”. And so on. We tend to think of Google, Apple, and pretty much anyone else as black boxes that function all the time, generally ignoring that these are enormously complex systems run by non-superhuman beings on non-perfect hardware and software. Mistakes are made. Software has bugs. Operational processes get screwed up. That’s how things are, they do the best they can, but nothing’s perfect.

The propagation of a cargo-cult solution

So that’s perhaps a valid theory for how non-solution solutions get started, but then they have to spread. Wouldn’t the fact that hundreds of people are saying in forums “this works” mean that it does? Not necessarily.

First, other people trying to solve the problem are also affected by variables out of their control, and they may experience similar results when trying multiple things in sequence.

Second, the people involved in first trying to identify the solution (let’s call them “Patient Zero”) are usually geeks/nerds. Take me, for example. I may have already been tinkering with my equipment, and perhaps in a rare case or two mucking around with, say, the MTU settings or blocking filters leads me to “unbreak” something that I actually broke… but that I don’t remember changing. But we tend to forget that most people don’t look at a router settings console in their entire lives. So then I post my “solution” as something to try and someone tries it and it works, it seems to confirm what I said, but either because of external variables, or because of… rebooting.

Yep. This is the third way in which “solutions” propagate as valid — just by rebooting. There’s a joke that’s been bouncing around the Internet for decades (I first heard it in the 90s) that goes like this:

Three engineers are riding in a car. One is a mechanical engineer, one is an electrical engineer, and one is a computer engineer.

The car breaks down and coasts to the side of the road.

“Hang on,” says the mechanical engineer. “The problem is probably the engine, let me have a look at it and I’ll have us on the road again in no time.”

“Wait,” says the electrical engineer. “The way it just stopped like that, I think it’s the electrical system. Let me have a look and I’ll get us going again in a minute or two.”

“That could work,” says the computer engineer. “But first, why don’t we all just get out of the car, lock it, unlock it, and get in again, and then see if it starts?”

Many if not most of the “solutions” that are flying around the web for everything from routers to servers to phones involve rebooting. Rebooting your machine, the router, disconnecting and reconnecting things, reinstalling OSes or firmware.

Rebooting/Reinstalling/Powercycling is like the utility knife of Cargo Cult medicines, and one that in many cases in fact works.

Why does it work? The reasons are myriad: Low memory, dead sockets hanging out for some reason, leaks, subtle bugs, you name it, there is still a need to reboot devices. In a certain percentage of cases, rebooting actually does fix a problem, frequently only on a temporary basis if there’s an underlying cause that the rebooting process is just resetting.

Myself, as Nerd Patient Zero, know this, and probably was the first thing I tried. But this is not true of everyone, and the least technically sophisticated people are the least likely to just start restarting things for no apparently no reason, because they don’t know that there’s a possible correlation between how long something has been running and possible corruption, misuse in resources that leads to resource starvation, leaks, etc.

There’s a reason tech support starts by asking if you have rebooted something. They’re not trying to be obnoxious, they just know that often this is enough to solve state-related problems, and a lot of people don’t think of trying that. The fridge, after all, doesn’t have to be rebooted to be happy, and even the original “Windows 95 Experience” (By which I mean not some fancy Microsoft Marketing term, but “Reboot every day at least, reinstall every 3 months if you want to have a speedy machine”) is not something that normal people remember to do all that often.

Enter the Internet

The fourth way in which things propagate is through the game of telephone that are Internet forums. A person may think they have a corrupt binary problem but they actually have another problem. Perhaps the download can’t start at all, instead of failing to authenticate. No matter. “This sounds kind of like what I’m seeing”. Even if “kind of like” is not really something that should apply when debugging this type of problem, they don’t know that. They change the setting (and in the process reboot the router, which perhaps was really the problem) and boom, it works! Or — they try a number of things in sequence, then Apple fixes the problem on their side, and presto! In flood the reports of success with the cargo-cult solution.

Fifth, and finally, this is perhaps a major reason, we share a strong cultural memory from mechanical and electrical devices in which seemingly ridiculous solutions actually worked. For example, the Apple III was infamously so poorly designed that in some cases when there were issues people were advised to lift the machine an inch or two from the desk and let it fall, which would solve the problem. This was because the action would re-seat the cards, which had been loose. Similarly, in some older TV sets hitting the TV on the side would fix the problem, because of more “mechanical” reasons, such as loose components, etc.

THIS IS HOW WE FIX PROBLEM IN RUSSIAN SPACE STATION!!!

One of my favorite moments from Armaggeddon is when they are trying to restart the engines of the shuttle go get off the asteroid, and Andropov, the Russian astronaut they picked up at the space station, gets frustrated with the lack of progress, goes down to some kind of engine room where Watts (Shuttle co-pilot) was frantically and apparently randomly pushing buttons, shoves her to the side and as he shouts “This is how we fix problem in Russian space station!” he starts banging on some pipes with a wrench. This being a typical Michael Bay movie, the solution works and everyone’s happy ever after. With complex software and hardware systems, however, the equivalent of hitting the equipment with a wrench can’t really solve the problem.

We will only, occasionally, just think it does.

--

--

Diego Doval

Avid reader, occasional writer. Drexel CS Alum, Trinity College Dublin PhD CS. Previously CTO & CPO at Ning, Inc. Now building n3xt! (http://bit.ly/whatsn3xt)