Update (2016/06/09): A product engineer from Lenovo contacted me after he came across this article. His team had also identified this problem and posted a fix on May 25th.
We have a root cause for this issue, and it will be resolved in our next UEFI BIOS release. Please take a look at this Solutions article HT500295. We have also posted the problem with solution in the Lenovo Enterprise Forums, as that is a common place that people go for help from the community.
I want to thank the Lenovo product engineer for reaching out to me about these issues. In addition to the BIOS bug, he also shed some light on other issues I’ve brought up in this post:
- The BIOS update won’t overwrite your existing setting. We only had 1 motherboard that we updated (the rest came pre-baked with the new BIOS) and it’s very possible that we did a factory reset on it. So, please disregard my statement below that it overwrites the setting.
- The CPU temperature thing is “a feature, not a bug.” Despite the fact that all our other Lenovo / IBM servers report the actual CPU temperature, this particular CPU reports an offset from its maximum temperature. So -30C means it’s running 30C cooler than it’s maximum temperature. This is, quite frankly, a terrible answer.
Update (2016/05/24): In what is most assuredly a bug, Intel Trusted Execution Technology has been changed from the default state of enabled to a default state of disabled in BIOS 3.34.
What’s very peculiar is that you should still be able to boot live environments and installers while Intel Trusted Execution Technology is disabled, but something is obviously glitched with this BIOS. For now, you can fix this issue by going into the BIOS and changing System Security > TPM Settings > Intel Trusted Execution Technology to Enabled.
When doing a BIOS update from 1.53 to 3.34, the update overwrites the existing enabled setting with a disabled setting. I have a feeling this article will be getting a lot of traffic from frustrated technicians over the next few weeks.
Unfortunately, the CPU temperature reading between -30 and 0 Celsius has not been corrected. I will patiently wait for Lenovo to acknowledge the issue and deploy an update.
Original article below…
When a plane wrecks, it’s never just one or two parts that failed; it’s the culmination of the perfect storm of problems. When you are trying to troubleshoot multiple simultaneous failures, process of elimination simply doesn’t work. Whether you are trying to fly a plane or fix a computer, it’s much harder to solve these types of failures than if only a single system was down.
I’m writing this to potentially save others the time and effort we invested on this problem. Between my engineers and IBM’s technicians, there’s over 100 man hours tied up in this.
On May 4, 2016, Lenovo released a new BIOS for the TD350 (BIOS TB5TS334.) It’s a hell of a version jump: 1.53 to 3.34. Our server shipped with this faulty BIOS installed from the factory. If you attempt to downgrade, it will brick the motherboard. Lenovo has still not pulled this bad update from their website.
Our initial problem may be particular to the Intel processor that is installed. This server has a single Intel Xeon E5–2640 v3 and a 720ix SAS RAID controller.
Users with the bad BIOS may experience the following problem when trying to install VMware ESXi: the installer hangs at “Relocating modules and starting up the kernel.”
We researched and found that some people resolved this by either 1). identifying memory issues in their machine or 2). adding a headless option to their installation. Neither of these solutions worked for us.
We are running multiple TD350 servers right now with VMWare ESXi 6.0.0 U2 on BIOS version 1.53. All of them work fine. Furthermore, this particular server and processor are listed as compatible on VMware’s website.
I’m afraid booting issues aren’t isolated to just VMware. We also tried to boot a linux live CD, linux live USB, and two different Windows installers. In all cases, the installer begins to boot, lets you pick a language and keymap, but once the installer starts to load actual data into memory (e.g. ramfs), the server freezes. The error code that sometimes appears in the TSM log is an “Unspecified Error (0d).”
A Series of Unfortunate Events
When we first received the server, we couldn’t boot into the TDM. The TDM is a tool built into IBM/Lenovo servers that let you do a variety of things, including configuring the RAID array. After some phone troubleshooting, IBM dispatched a tech to replace the 720ix RAID card. After the card had been replaced, the TDM started working.
The tech left and I configured the RAID6 array. I wanted to be sure there was no further issues with the machine, so I let it do a full RAID6 initialization, which took about 12 hours.
The following morning, I went to load an operating system. This is when I discovered that I couldn’t boot into anything.
These servers come with a utility called the TSM. This is a web based management tool that runs independently of the server (it functions even when the server is off or bricked.) I noticed, through the TSM, that the CPU temperature was reporting in the negatives. Which is, quite frankly, not possible.
IBM dispatched a technician to replace the motherboard. This fixed the booting issue, but it didn’t fix the CPU temperature issue. We tried 3 different CPU temperature monitoring tools and they all reported in the negatives.
What we didn’t realize at the time was that the new motherboard they had installed was still on the old 1.53 BIOS. This is why booting worked and we were able to install an operating system.
The server was working, but I still wanted to fix this CPU temperature issue. Fan speeds and other important metrics, such as CPU throttling, all depend on an accurate reading of the CPU temperature. I opened yet another ticket with IBM.
IBM sent a tech out to replace the CPU — unsure at this point what else to try. He replaced the CPU, but unfortunately, the temperature problem continued to persist. So, we checked the BIOS, TDM and TSM versions. At that time, we had the following:
- BIOS: 1.53.0
- TDM: 1.2.10
- TSM: 3.31.84
The TDM and TSM are up to date, but the BIOS was still on the old 1.53.0. Not realizing that our boot issue was solved by going back to BIOS 1.53 with the new motherboard, the tech upgraded the BIOS to 3.34.0. Afterwards, as you can imagine, ESXi wouldn’t boot.
My systems engineer, the IBM tech and myself all sat around a table trying to figure out what the hell had happened. Suddenly, everything started to make sense: the only common denominator was the bad BIOS version!
The tech attempted to downgrade the BIOS. Unfortunately, this ended up bricking the server. During POST, if your server locks up at “Initializing PCI devices,” the BIOS is most likely corrupt.
The TSM web management tool was still functional. We tried to check the BIOS version through the TSM, but it reported as “N/A.” We even tried to force a BIOS update through the TSM, but it would not install.
The tech then attempted to use the jumper on the motherboard to recover the BIOS, but it also failed. He made the decision at that point to swap out the motherboard. Fortunately, he had one in his truck.
The new motherboard (the third one we’ve had so far) was flashed with the bad 3.34 BIOS from the factory. As you can imagine, nothing would boot. So, we once again attempted the downgrade, which failed and bricked another board.
That was around 11pm last night. I’m currently waiting to hear back from IBM while they work to find a motherboard with the old BIOS installed. They still have no answer or explanation for why the CPU temperature is reading in the negative. Despite their technician agreeing that the BIOS is the problem, Lenovo has not officially admitted the BIOS is bad. I will update this post if that changes.
The story above is an abbreviated version of events. If you are a representative of Lenovo or IBM, please email email@example.com from your work email address for the full explanation of the situation.