How to deal with random freezes, reboots and shutdowns?

A dual-GPU setup for developing AI models

This high-end dual-GPU setup was rebooting randomly, can you guess why?

Hint: Two components were bad and causing similar symptoms.

Troubleshooting hardware-related issues is not as straightforward as it seems. Random freezes, reboots and shutdowns are the hallmarks of hardware failures, and most of the time they will occur randomly when they are the least expected, e.g. during idle. They can also occur during gaming and heavy CPU/GPU compute, which makes troubleshooting slightly easier.

Answer: The PSU (850W) was underrated and an ECC RAM stick was faulty.

Tale of a new rig

And then, out of nowhere, BOOM!

It just rebooted. Worse, it turned off. Or perhaps you got a BSOD or a kernel panic. What the heck? Retailers say that less than 2% of computer parts are faulty. Surely this is a software error, or is it?

Many things can go wrong when testing a new system, and it can be quite difficult to find out which components do not work well. This is because most faulty parts will not be fully dead, instead they will appear to work well for some time and then fail, either randomly or with increased system load and temperature.

I have been dealing with tons of hardware failures in the past, including ‘esoteric’ failures that would randomly occur once every month, or things that you would never guess such as this brand-new and reinforced SATA cable that works flawlessly with some HDDs but not others, or even a bad USB hub or mouse. It is not always possible to determine exactly what parts are wrong, and the process of determining what is wrong in a system almost always involves some form of guesswork. This uncertainty will often make it difficult to get parts replaced or refunded by retailers, who will happily charge you a fee to return your goods back to you when they find that they work correctly. The truth is, depending on the manufacturers and type of hardware, an average of 20 to 25% -one every four to five- of computer parts actually exhibit a hardware fault of some sort, which is not always obvious at first. These parts, which normally passed manufacturer tests and quality control checks, will still give you an ocean of random crashes and a few headaches. Perhaps manufacturer tests are incomplete, or the parts get too often damaged during transit, who is to blame?

In this post I describe how I deal with all sorts of hardware failures and how I identify faulty components. This will also hint us about the context of failure, in other words, what makes a particular component fail, so that we can tell RMA departments how to reproduce the faults when we return computer parts.

The lucky, less lucky and unlucky outcomes

  1. The lucky outcome: It works flawlessly. You can stop reading. Disclaimer: This has never happened to me yet.
  2. The slightly less lucky (but still lucky) outcome: It does not work at all, the system is dead-on-arrival (DoA). Just return all the parts for a full refund, or remove each part one by one until you find out what works and what does not. Return what does not work. Get replacement parts. End of story.
  3. The unlucky outcome: It appears to work but not all the time. Freezes, reboots or shutdowns occur randomly. If you’re not too unlucky, crashes will coincide with the use of a specific component (e.g. the GPU) or with a specific use case (e.g. gaming or heavy compute).

Are hardware faults really random?

Determine when faults occur

Truly (seemingly) random faults

Faults that depend on a component

Faults that depend on system load

Common culprits: PSU, temperatures, RAM, motherboard, CPU, storage components, USB controllers

He gets his first reply 5 minutes later from another guy who obviously didn’t read his post fully. That’s a one-liner: “Have you tried with a different PSU?”

As irritating as this answer sounds, it is the right question to ask. Testing the PSU is almost always the right thing to do when a system does not work well.

Important: Before trying any of the steps below, UNPLUG everything unnecessary from the machine (webcam, headset, additional storage controllers, etc.), test your keyboard and mouse on a different computer (a laptop is fine), and re-plug them into a different USB port. Load BIOS defaults (‘factory defaults’ or ‘optimized defaults’, it does not matter which one you choose as in both case, it should -almost- always work). Do not alter any BIOS settings if there is no good reason for it. You may have to disable integrated devices and increase voltages later on.

The PSU

Step 1: PSU. Make sure that your PSU is good. Have a spare ‘known-good’ PSU in your drawer (seriously). Testing a system with a good PSU is a no-brainer and solves a lot of issues. If you don’t have a spare PSU, order two PSUs right from the beginning. If all works well, return the PSU that you did not use. The extra return fees will be worth the trouble if you have bad hardware. Do not trust vendor recommendations for PSU wattage. If a vendor recommends a 600W PSU, get a 800–1000W PSU. High-end GPUs are known to have huge power consumption peaks that routinely exceed their wattage specifications by a fair amount. This causes a lot of confusion for people who think that they can just sum up 150W for the CPU, 250W for the GPU and ~200W for the rest of the system and find that they just need a ~600W PSU. Get much more than what you think your system will need, seriously.

The temperatures

Step 2: Temperatures. Check the temperatures from all available sensors. If it’s obviously too high, get a visual confirmation that the fans are spinning. If it’s still too high, visually inspect the heatsink and its surface for dirt and irregularities. Return any heatsink that looks crooked or that has irregularities on its surface. Was the thermal paste applied homogeneously, without gaps? If you find that the pad left by the thermal paste is narrower on one side than the other after heatsink removal, check for loose screws, make sure that the mounting holes are sound, and have a good look at the overall fit of the heatsink. The surface of the heatsink may be slanted, which may not be obvious at first. Search for images of your heatsink model, and compare your installation with what you find on the internet to see if there is any deviation.

The RAM

Step 3: RAM. Remove all RAM sticks but one, and use the same RAM slot every time you swap a RAM stick. Let your system run for at least a few days. Try again with a different RAM stick. Repeat until you find out if it makes a difference. If this procedure gives you an unstable system for more than one RAM stick, repeat the procedure with a different RAM slot. If the system remains unstable no matter which RAM stick and RAM slot you used, and it does not make a difference at all after a few days, then the RAM is most likely fine. You may find that RAM sticks fail only in specific slots or when combined together in a particular way. If this is the case, the RAM may still be good, read further.

The motherboard

Step 4: Motherboard. Update the BIOS to the latest version, or downgrade to a previous version if it’s already up-to-date. Unplug everything else, including chassis’ cables for USB and audio connections (just keep a graphic card if you don’t have an integrated video output, and the power button cable from the chassis). Check voltage sensors in the BIOS, make sure that reported values look coherent (they will vary due to auto-voltage regulation). Disable all integrated devices in the BIOS and use a clean operating system without any manually installed drivers, unless there is a specific known issue about the motherboard requiring a particular driver to work properly. This is very sensible as we have seen too many times buggy drivers for motherboards causing random crashes, including drivers for integrated storage controllers. Check that there are no screws stuck between the chassis and the board. Make sure that there isn’t any metallic objects around the surface of the board too. Verify that the motherboard is seated properly, does not bend and that there is no significant pressure point. For this, visually inspect the motherboard surfaces and keep an eye for bad solder bonds by comparing similar components together. As a rule of thumb, if a fault still occurs randomly at this stage, and you know that the RAM and the PSU are good, then it is likely an issue with the power circuitry of the motherboard. It may also be a problem with the CPU.

The CPU

Step 5: CPU. If you have an Intel CPU, run the Processor Diagnostic Tool from Intel (https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool). If all tests pass or if you have an AMD CPU, I recommend to increase the IMC voltage setting (generally found in the BIOS or through motherboard-specific software) by ~10% (e.g. from 1.1V to 1.2V). If it does not make a difference, try also increasing CPU voltages (there will be many different settings). For this, browse overclockers’ forums to see how they achieve stability with their systems. But don’t overclock your system yet, and avoid increasing any of the voltage settings by more than 10% to stay on the safe side. If increased voltages make your system stable, we still can’t tell if the issue comes from the CPU or from the power circuitry of the motherboard (or it could be both), but at least you know that the rest is most likely fine. It may be a faulty voltage sensor, resulting in incorrect voltage regulation. You will need to swap the CPU or the motherboard to find out which one of the two is truly faulty.

Storage components

Step 6: Storage components. Test different data cables, different data ports, unplug any non-essential HDDs or SDDs. If you are just using the motherboard storage controller and you have done Steps 1–5, disable the storage controller in the BIOS and boot your machine from a USB drive.

USB controllers

Step 7: USB controllers. Make sure that you are using a clean operating system without any manually installed drivers for the motherboard. Get a ‘known-good’ device such as a mouse or a USB dongle, and test all USB ports one at a time. Determine which ones are faulty. USB ports may be color-coded if your motherboard includes multiple USB controllers. Check in the BIOS or motherboard manual which USB ports correspond to which USB controllers, and see if that makes sense with the ports that you found faulty. It is quite rare to get multiple faulty USB controllers. If you found multiple faulty USB ports from different controllers, then you most likely have a faulty power circuitry or a bad motherboard connector block. If you found that all faulty USB ports belong to the same USB controller, then you pretty much nailed it down. Disable this particular USB controller in the BIOS and test again. Embedded devices on the motherboard may also be connected through USB internally, so it could help to disable these devices from the BIOS when you test the USB subsystem. Albeit not impossible, it is really hard to believe that a motherboard with a faulty USB subsystem would pass manufacturer tests.

Practical example: System crashing when using the GPU

Original issue (‘Machine restarts when running TensorFlow with GPU’): https://github.com/tensorflow/tensorflow/issues/8858.

That’s how I would proceed to determine the cause of machine reboots.

Machine rebooting or shutting down without any logs

Most probable cause: Underrated PSU cleanly shutting off to protect the system during overload.
Symptoms: Random reboots or shutdowns soon after the GPU starts to run.
Solution: Replace with a higher wattage PSU.

Second most probable cause: Underrated or bad PSU delivering unstable voltage due to high current transients or overload or both.
Symptoms: Random freezes (more likely) in addition to ‘clean’ reboots or shutdowns described above. This could also occur outside of TensorFlow, e.g. during CPU and RAM use.
Solution: Replace with a higher wattage PSU from a reputable brand. Check cables and connectors, they may have started to melt.

Third most probable cause: Faulty GPU, RAM or motherboard.
Symptoms: Random freezes, reboots or shutdowns. Bad RAM would cause these issues outside of TensorFlow if used heavily. It may give away some log entries time to time. A bad power circuitry on the motherboard would likely cause additional issues when using the CPU and RAM.
Solution: Move the GPU to a different PCIe slot. Try a slower (x8) slot. Test the RAM (see Step 3 in the first part of this post) without the GPU. If you have a multi-GPU setup, test one GPU at a time. Test a game under a different operating system. Test the GPU with a power limit and a lower frequency (see nvidia-smi command in Linux). If it works with a different PCIe slot or with a power limit and lower frequency and you are sure that the PSU is good, you may have a bad motherboard. If the system remains unstable, the GPU should be tested in a different machine, or a different GPU should be tried.

Other possible causes: ‘Esoteric’ PSU faults such as an issue with the overcurrent (OCP) / overload (OLP) circuitry causing the PSU to shut off too soon for no good reason during fast current transients, a bad power cable (would be fixed by replacing the PSU and the cable set), a driver issue (this can be eliminated by upgrading the driver or if you find other users with similar hardware and they don’t experience the issue at all), a bad CPU (rare).

Machine freezing without any logs

Machine crashing (freezes, reboots or shutdowns) with logs

Logs that you can check include:

  • BIOS Event Log (server boards only; it is located in the BIOS and may have a different name depending on the board model)
  • System Event Log (SEL; server boards only; it can be queried via IPMI, e.g. ipmitool sel list)
  • Linux regular logs (e.g. /var/log/kern.log)
  • pstore (in Linux, /sys/fs/pstore/; these are stored on the motherboard)
  • (Have you got any suggestions for Windows and MAC? Please share!)

Concluding remarks

Good luck!

Originally published at sebastien.behuret.net on December 28, 2018.

Computational Neuroscientist & Technology Enthusiast