Fun with latest HPE SmartArray controllers

With latest Gen9 systems HPE published the new Generation of SmartArray controllers. These x4x controllers like P440, P441 or P841 are the fastest, most feature-rich and most flexible SmartArray controllers ever. With an impressive maximum performance of up to 1 million IOPs and a maximum number of up to 200 disks (P441) these controllers are pretty well suited for Software Defined Storage systems or any other DAS-related solution.

I, personally, use these controllers in DataCore environments where performance and scalability is a key factor and where flash devices are pretty common. Most older RAID controllers are not suited very well for SSD or flash generally spoken since they produce a bottleneck with their CPU or general bandwidth or, most often seen, they simply can't handle flash traffic very well.

The x4x controller series is developed to be used in conjunction with flash so these bottlenecks and limitations should be a matter of the past.

With this knowledge in mind I was pretty surprised during my last DataCore projects seeing some massive performance drops in the I/O backend sometimes. These drops were that strong that the complete I/O stopped during a period between 2 and 20sec. Fortunately most of these drops encountered during the DataCore pool initialization when there was no other I/O on the system except the initialization traffic itself. For explanation, during pool initialization the DataCore software simply writes zeros on the disks to prepare them to be used within DataCore. This is a 100% sequentiel IO stream without any special requirements. It simply writes zeros. So it was pretty clear that nothing could produce these drops except the controller itself.

Read More Comment (4) Hits: 4448

vSphere random "lost access to volume due to connectivity issues" messages

Dealing with connection issues in a FC SAN environment isn't my favorite thing and dealing with random errors that are logged in vCenter's event log definetly belong to the things I don't like at all. These errors especially if they occur without any pattern on a random base are quite difficult to troubleshoot because you can't estimate when they occur the next time. So observing is hard and checking if your changes had a positive effect is even harder.

This time a customer called me and told me he gets several of the "lost access to volume due to connectivity issues" and a few seconds later the corresponding "Successfully restored access to the volume following connectivity issues". The errors pop up in vCenter's event log and mainly occur during the night. So my first guess was a storage problem related to high load during backup sessions. Before I even could setup some test monitoring systems the customer did some "self-service" and moved the some VMs to another host because only three datastores were mentioned in the error logs and all VMs running on these datastores ran on the same vSphere host. So VMotioning the VMs was a valid option. After the move the error keep popping up but this time not only during the night but rather all over the whole day. The time between the errors was random, no pattern. But the interesting thing was, after the VMs were moved to another host, the errors stopped occuring on the former host but started to pop up on the destination host. So the error seemed to move with the VM.

Read More Comment (1) Hits: 11688

Sytem reboot during SPP boot on HPE Gen9 servers

Today I was faced with a strange issue on several new Gen9 servers from HP. I had 6 HPE BL460c Gen9 systems running ESXi from a flash device. To be able to boot from internal USB port we were forced to switch to legacy boot mode instead of UEFI. I already wrote about it in the past.

Today was general firmware upgrade and patch day so we downloaded latest SPP from HPE (2015.10) and bootet the systems one after the other with this DVD. The ISO was mounted via iLO Remote Console in ActiveX Mode. The server did all the preflight checks and bootet the SPP. During the first 10sek it displayed the progress bar on the lower side of the screen up to 91%, then blanked the screen and rebootet without warning.

The next time the BIOS initialization was done it showed up an "Uncorrectbale check exception" on CPU0. As these systems run for several months without any problems it is quite impossible that the CPU is really broken. For test reasons we took the 2015.06 version of the SPP and voila, the SPP booted and we were able to update the firmware on all components.

But why does the 2015.10 version didn't boot? We already had other Gen9 systems (mainly ML350) that didn't have this error so the SPP 2015-10 itself couldn't be the problem.

Read More Comment (0) Hits: 4386

HP Power Management - better leave it to the OS

Recently we had two customers seeing undefinable performance problems on recent Gen8 and Gen9 systems running either VDI under VMware vSphere or simple file copy jobs under a Windows Server 2012 R2 system.

The VDI problem was quite curious as the full processing power of the Gen8 systems was never used but storage latencies raise up to 500ms (the storage system itself reported <1ms response time since it was an all-flash array). The customer did everything to solve that problem but in the end the power management from the HP ProLiant systems was the problem. The reason behind this is the default setting for all new ProLiant systems to use a "balanced power mode" that throttles down server components when they are not fully used. The BIOS has to define when the throtteling has to be done but the BIOS often doesn't know or is unable to decide on a stable base if the performance can be reduced to save power or not. This can only be done by the OS, no matter if it is a hypervisor like ESXi or Windows or Linux. Therefore, the power management can be set to either balanced, static low, static high or OS control performance mode. As already said, balanced is the default value.

Static low will save much power but will render your super fast, super expensive server into a "laptop-performing system". Static high will fully ignore any power savings and will kepp the system always at peak performance. OS control will give the OS the control over power management.

VMware did an excellent benchmark over the various power settings and the results are clear: BIOS handled power management (or balanced mode) is the worst you can do from a performance perspective and will result in unpredictable performance. Best is to use either full power mode or OS control, especially with latest versions of vSphere and Windows. See the full report here.  Additional information for vSphere 5.5 and above can be found here and here.

Read More Comment (0) Hits: 3697

HP Proliant/NVIDIA GPU limitations

Using GPUs for arithmetic-intensive workloads is getting a more and more common setup. With GPUs from NVIDIA like Tesla or Keppler based boards, the performance of GPU-based computing is getting extremely high and thus getting more and more interesting for customers.

Beside the high power input rate and the needed cooling for these cards inside modern server systems there is another hard limitation I wasn't aware of since today.

For HP Proliant servers (and probably all other vendors too as it is more a general than a specific vendor problem) there is a hard limit on the amount of memory one can use inside a server system when using NVIDIA based GPUs. Because of memory addressing limitations inside the NVIDIA GPU the maximum amount of useable RAM is limited to less than 1TB.

1TB sounds quite a lot and in fact, it is. But with memory capacities raising steadily and costs dropping the time for systems with more than 1 TB is near. Even a standard HP Proliant DL380 Gen9 system is capable of up to 1.5TB of RAM. Additionally in HPC environments where GPUs are used, memory is often a key factor.

So don't be fooled and think that amount of RAM is only seen in very rare and special configurations. It isn't....

There is an advisory from HP linking to some NVIDIA documentation for all of you who want to get a closer look at this limitation.

Comment (0) Hits: 1737

Installing VMware ESXi on SD card on HP ProLiant Gen9

A few days ago one of our customers got some brand new BL460c Gen9 systems which will be used as VMware hypervisor hosts. As boot medium a 32GB SD card was configured and built into the internal sd card slot that each ProLiant since G7 offers by default.

During setup the ESXi installer was unable to detect the SD card and thus won't find any suitable target for installation. Hm, quite strange, as we are 100% sure that the SD card was installed and properly detected. Furthermore, we used the latest HP optimized ESXi 5.5 installer ISO, so drivers shouldn't be the problem.

A few minutes searching on the internet revealed a HP advisory that this is probably caused by a bug in the iLO4 firmware and is solved by upgrading to version 2.03. Unfortunately this firmware was already installed and the SD card was shown under possible boot media from the UEFI BIOS.

Just because I had no other idea, I changed from UEFI to legacy BIOS to force the Gen9 systems work like former Gen8 servers and voila, the SD card was detected by the installer. As ESXi 5.5 won't benefit in any way from an UEFI BIOS this isn't a problem at all. We cross checked this behavior on three other hosts (same model, same config, same bladeenclosure) and could reproduce this behavior every time. So it seems, UEFI in combination with the internal SD card reader is a "no-go" config for ESXi 5.5 installations.

Comment (5) Hits: 32962

3PAR StoreServ peer persistence in iSCSI environments

The peer persitence feature of the 3PAR StoreServ family brings transparent failover function the mid-range storage systems from HP, a feature absolutely fundamental for the euopean, especially german market. Nearly 2/3 of all our storage projects are based on a two datacenter concept using simultaneous mirroring between the sites and fully automated failover in case of a site failure. Transparent failover on the storage side is one of the key concepts behind these environments.

A second evolving technology is iSCSI. For a quite long time, iSCSI was called the FC successor but never catched up with FC if you ask storage admins. They like the easyness of FC, the reliability, the small overhead, the pure performance. Especially in the performance area, iSCSI catched up and outpaced FC. While FC is still limited to 16GBit, Ethernet is available at 100Gbit and further evolving. Don't ask for the sense in 40GBit or even 100GBit SAN in the mid range market but if you want to spent a lot of money and have a storage system that is capable of saturating 40GBit and more, feel free to do so. With FC you are still at 16GBit. More than enough for 99% of all mid range companies.

Read More Comment (0) Hits: 3312

HP SmartArray issue

Happy New Year to everyone. A bit late but since this is my first post in 2015 I take the first chance.

Unfortunately my first post is related to a support document HP released a few days ago regarding problems with SmartArray controllers and hanging servers if you run some really usual check tools.

The advisory states the following adapters to be affected: HP SmartArray Px2x series, Px3x series, B120i and B320i series. So roughly all ProLiant Gen8 servers that had these controllers built in by default. The new Px4x controllers are not affected so Gen9 servers are okay.

The problem occurs in combination with these driver versions for windows:

  • hpsa2 version 6x.8.0.xx
  • Hpsa 6x.10.0.xx.
  • Hpsa3 version 6x.0.0.xx

The problem causes Windows based servers to freeze and stop responding if you run the Windows tool chkdsk with the options /r, /b or /f. With these flags set the chkdsk command uses SCSI to verify volumes and this causes the SA controllers to hang up.

For the B120i and B320i a driver update is already available, the P controllers that have surely a wider installation base still suffer this problem as there is no driver update currently available. HPs workaround suggestion is simply not to use the chkdsk command. Well, a quite basic workaround but it work 100%.

HP defines the driver update as critical so as soon as the new version will be available (or is available for the B-controllers) everyone should immediately install this new version.

Links for the updated drivers can be found in the advisory, the link to the advisory is here.

Comment (0) Hits: 4911
joomla templatesfree joomla templatestemplate joomla
2017  v-strange.de   globbers joomla template