Today I was faced with a strange issue on several new Gen9 servers from HP. I had 6 HPE BL460c Gen9 systems running ESXi from a flash device. To be able to boot from internal USB port we were forced to switch to legacy boot mode instead of UEFI. I already wrote about it in the past.
Today was general firmware upgrade and patch day so we downloaded latest SPP from HPE (2015.10) and bootet the systems one after the other with this DVD. The ISO was mounted via iLO Remote Console in ActiveX Mode. The server did all the preflight checks and bootet the SPP. During the first 10sek it displayed the progress bar on the lower side of the screen up to 91%, then blanked the screen and rebootet without warning.
The next time the BIOS initialization was done it showed up an "Uncorrectbale check exception" on CPU0. As these systems run for several months without any problems it is quite impossible that the CPU is really broken. For test reasons we took the 2015.06 version of the SPP and voila, the SPP booted and we were able to update the firmware on all components.
But why does the 2015.10 version didn't boot? We already had other Gen9 systems (mainly ML350) that didn't have this error so the SPP 2015-10 itself couldn't be the problem.
I thought about it and googled a bit and found a similar problem with G7 servers. Here the solution was to set the power control to "Maximum static performance" which disables C-states. I tried the solution but got the same error.
Next step was to change back to UEFI boot as all other Gen9 systems that were upgraded successfully use UEFI. Said and done and voila, the SPP booted. With UEFI there is a special function while booting the SPP. The SPP will be transferred to the internal flash area of the iLO. During this transfer I got a new error telling me something about "The BIOS Has Corrupted Hw-PMU Resources". This is a short error displayed during boot of the SPP before the image is transferred to the iLO flash area. Luckily the error didn't force the system to be rebooted and the SPP boot continued.
Googling for this error showed up another bug Gen8 and Gen9 servers are affected. There is a new function introduced by Intel called "Processor Power and Utilization Monitoring". This function causes RedHat based Linux kernels to crash during boot up. Guess what linux kernel the SPP is based on....
There is a simple solution. Reboot the system, enter the RBSU (press F9 during startup), press CTRL+A (a new technical submenu appear), select "Service Options" -> Processor Power and Utilization Monitoring and disable it. Reboot the system once again and the error should be gone.
Probably the errors mentioned above are solved with latest firmware for the BIOS and the power microcontroller but if you are on older versions you have a "chicken - egg" problem.