Fun with latest HPE SmartArray controllers

With latest Gen9 systems HPE published the new Generation of SmartArray controllers. These x4x controllers like P440, P441 or P841 are the fastest, most feature-rich and most flexible SmartArray controllers ever. With an impressive maximum performance of up to 1 million IOPs and a maximum number of up to 200 disks (P441) these controllers are pretty well suited for Software Defined Storage systems or any other DAS-related solution.

I, personally, use these controllers in DataCore environments where performance and scalability is a key factor and where flash devices are pretty common. Most older RAID controllers are not suited very well for SSD or flash generally spoken since they produce a bottleneck with their CPU or general bandwidth or, most often seen, they simply can't handle flash traffic very well.

The x4x controller series is developed to be used in conjunction with flash so these bottlenecks and limitations should be a matter of the past.

With this knowledge in mind I was pretty surprised during my last DataCore projects seeing some massive performance drops in the I/O backend sometimes. These drops were that strong that the complete I/O stopped during a period between 2 and 20sec. Fortunately most of these drops encountered during the DataCore pool initialization when there was no other I/O on the system except the initialization traffic itself. For explanation, during pool initialization the DataCore software simply writes zeros on the disks to prepare them to be used within DataCore. This is a 100% sequentiel IO stream without any special requirements. It simply writes zeros. So it was pretty clear that nothing could produce these drops except the controller itself.

 

What I had done to produce these drops: I installed my OS and the DataCore software and then created some RAID5 and RAID6 arrays on the P840 controller with the Smart Storage Administrator software. The SSA is a Windows based software that gives you a GUI for configuring SmartArray controllers. The P840 was on latest firmware (3.56) and the SSA software was from the SPP bundle of Oct. 2015. Not the latest version but that shouldn't matter.

After the creation of the first RAID sets I decided to add the new disks to the DataCore pool to let the initialization start. The process was extremely fast with up to 600MB/s for EACH RAID5 array. Each array consisted of 5x900GB 10k SAS drives so I was pretty impressed about the power of that controller.

During the initialization I created another array and suddenly the initialization procress nearyl stopped. I dropped down from 600MB/s to a few bytes/sec for all arrays I had in the pool.

SA drop SSYV

You can see in the picture above the drop to zero for about 20s (20:27:52 until 20:28:15). This was on all disks served by the P840 controller.After 20s the IO resumed to normal speed and we were back at 600MB/s. So what caused this drop?

The simple answer is: any disk config change on the P840 controller via the SSA causes such a drop. I tested with creating new array -> drop! Tested with deleting an unused array -> drop! Tested with adding or removing a spare to an used or even unused array -> drop! Tested with SAS, MDL SAS and SSD arrays -> drop!

So no matter what config change I made that included a reconfiguration of any disk on the controller the drop occured. Changing general controller parameters like background scrub prio or transformation prio, cache settings or whatever you can change on the controller itself had no effect on the performance.

To have DataCore out of the test pattern I decided to copy some data on the OS array (same SmartArray controller, RAID1, SAS, 2x300GB, Windows Server 2012R2) to another directory on the same drives. So stopped DataCore completely and copied the c:\Windows folder to c:\temp\windows. It started with ~16MB/s because of the many small files in the Windows folder but after a few seconds it remained constantly at ~20MB/s. Now I repeated my test and started with adding a spare disk to the RAID1 array... DROP! This time the Windows copy progress bar dropped to 30byte/s and remained there until the config change in the SSA completed. Instantly the copy speed came back to 20MB/s. So even natively on the Windows OS these performance impacts were seen. This is a clear evidence for a general misbehaviour of the controller and is not related to any software or the OS itself.

The period of REAL SLOW PERFORMANCE depends on how long the config changes need to apply on the controller. During my tests I saw significant differences in how long a config change needed to finish. Sometimes creating an array was done within 3-5s, sometimes creating the same array again took 20s so I can't say how long you have IO expect to be stopped. The answer here is always: it depends.....

Nevertheless, this is nothing you should expect from an next gen enterprise grade controller. Older controllers like the P812 or P410 never showed such a behaviour and I never saw anything like this on genuine LSI controllers so it seems to be some kind of problem on tha latest ASICS of the Px4x series.

During normal operations these cards are super fast and an absolut recommendation for anyone who needs power in the DAS area but you should currently always keep in mind that changing config on these cards will probably lead to IO disruption and you should plan these changes carefully.

For our DataCore installations this isn't a big problem. We simply put the affected DCS in maintenance mode and do our changes. I/O interruption then isn't a problem at all. But doing online upgrades or general changes is a NO-GO until HPE solves this problem.

We raised a call with HPE to get an explanation for this behaviour and I will kepp you up to date as soon as we get feedback from them.

Leave your comments

Post comment as a guest

0
Your comments are subjected to administrator's moderation.

People in this conversation

  • Guest - Elmar

    Hello,
    we have many Latency on Pools if a disk in a Raid goes defective. Can you confirm this?

    Thank you

  • Guest - Oliver Krehan

    Hi Elmar,

    this is normal behavior. As long as the RAID rebuild is ongoing, the volumes served by this array in the pool are pretty slow, especially if you have a degraded RAID6 but also in RAID5. You can setup your SA controller to handle defective RAID arrays in alternate way so perhaps this will reduce the impact but I haven't tested it yet. The other possibility to deal with this is to have SSY-V at least at PSP4 and then you can "isolate" bad performaning disks as long as they rebuild. Check out this article to see how this technique works: http://www.v-strange.de/index.php/datacore/14-sansymphony-v/216-what-s-new-in-sansymphony-v-10-psp3-4

    Regards,
    Oliver

  • Guest - G.F.

    hi Oliver,
    We see this also with the P812 and the P822 controller. Especially when the controller has high CPU usage (which can be seen in the System Management Homepage). It looks like the controller flushes all the cache to disks, set the controller in write through mode temporarily and then handles the config change. We see that the first change (adding a Raid1 Array e.g.) takes the most time. When adding more arrays, the time of the change is less and thus the slow performance (latency on the DataCore diskpools) is shorter.
    With the new Service Pack for Proliant Servers we see this behaviour also with just starting the HP Smart Storage Administrator! I am curious what the outcome will be from your call at HPE.

  • Guest - Falk

    Hi,
    das ist schon immer bei allen SA Controllern der Fall gewesen. Wirkt sich aber nur bei Heavy I/O aus, wenn der Cache zu schnell gefüllt wird.
    Die Konfiguration jeder Disk wird immer auf alle Disks am Controller geschrieben und während dessen können die Disks keinen anderen I/O annehmen. Um so mehr Disks du an dem Controller hast, um so stärker wirkt sich das aus.
    Gruß Falk

Powered by Komento
joomla templatesfree joomla templatestemplate joomla
2017  v-strange.de   globbers joomla template