Hardware Fail Tales

Software peeps: here are a few quick hardware fail stories to cleanse your mind.

Jeff Enderwick
3 min readMar 27, 2014

“The algorithm doesn’t converge …”

1988 — I was a junior hardware/firmware grunt at Gould. I was the dumbest guy at a company that made multi-processor mini-supercomputers. One day the chief architect came to me and said “I run this convergent numerical algorithm on CPU #1 — it works. I run the same code on CPU #2 — it never converges.”

After chasing many wild geese, I noticed that the schematics showed 7ns RAMs, and that the chips on the board were 10ns RAMs. Back then, CPU registers were actually implemented with RAM chips.

It turned out that some purchasing weasel had bought the wrong speed chips, and then he couldn’t return them. So the bureaucrats “tested” the 10ns RAMs and “proved” that they were just as fast as 7ns RAMs. They then built the computers using the slower RAMs. Too slow RAM, wrong value gets latched off the outputs. CPU #2's RAMS were just slow enough to corrupt the floating-point values as they were read out of the registers — sometimes.

Corrupt floating-point values were enough to wreck the algorithm without causing a crash.

“Sometimes the modems don’t train …”

1996 — Back in the era of modems, we built high-density modem cards for ISP or remote-access use. 24 modems on an ISA/PCI card. These were made from Rockwell modem chipsets, with each modem requiring seven different chips.

One day, manufacturing was having a cow — very few boards were shippable. I shut the production line down. Now the CFO is sweating bullets (we were public). The clock is ticking, and the problem needs to get fixed ASAP or we miss the quarter. It took me almost two weeks. It would have probably gone faster if I was an EE or just smarter.

It wasn’t the analog section, it wasn’t noise, it wasn’t so many different things. In the end, I correlated failure with date codes on one of the chip types combined with low (but still in-spec) supply voltage. I had my smoking gun. We went to Rockwell with the data. Turns out they knew about the problem but didn’t tell anyone! A multiplier in that chip would emit an incorrect result (sometimes) when run at the lower voltage.

CFO made a financial shit sandwich, and fed it to Rockwell. Rockwell ate every morsel.

“The register values are impossible!”

1999 — I tend to be anal about unexplained crashes — I don’t like to let them slide. I was working at a router start-up, and I saw a crash. Not unusual, since 30 people were constantly hacking the kernel. But this one didn’t make sense. I looked at the register values and the disassembly at the crash site. The values were impossible given the code that supposedly executed right before the crash.

A smart hardware guy (the best system designer I’ve ever met) suggested that the interrupt/exception handler did the dirty deed. No, he insisted that was the problem. A software buddy and I combed over that handler code. There was nothing wrong with that handler code.

As luck would have it, I somehow managed to make it happen again and again — about every 30 min. My software buddy and the hardware guy hooked up a logic analyzer and watched. They caught it!

This was a NEC MIPS CPU, with a NEC PCI/accessory chip attached. When a timer interrupt would happen, the CPU would run a small number of instructions out of an 8-bit wide ROM at the beginning of the exception handling process. The accessory chip was responsible for reading four sequential bytes out of that ROM and then making them into a 32-bit instruction word for the CPU. Under special circumstances, a bug in the accessory chip made it read the same location four times in a row (rather than sequential locations). So the instruction word fed to the CPU was the same byte concatenated with itself four times — and by chance this was a legal instruction for the MIPS! Executing this “synthetic” instruction is what mangled the CPU registers.

So in the end, the hardware guy was right, but it wasn’t the code after all.

--

--

Jeff Enderwick

Has-been wanna-be glass artist. Co-Founder & CTO at Nacho Cove, Inc.