Published: Thursday, 23 June 2016 16:03
A few days ago I was at a customer that uses Veeam to backup his vSphere environment. Nothing special in the configuration. We run this setup for a few years now and it was bulletproof.
Recently we upgraded to Veeam v9 and vSphere 6 to be on the latest major versions both vendors offer. Since then we have a strange problem that was overseen for a while.
The customers security requirements deny access for any system from the internal network to the DMZ, especially for the backup server. To have fully consistent backups of VMs running in the DMZ we use Veeam VAAIP agents uploaded to the VMs via VIX because RPC (admin share access via CIFS) is not allowed. This worked perfectly even with Veeam v9 and vSphere 6. Perfectly until the customer upgraded the VMware Tools on his DMZ VMs. The weeks before the problem arose the VMs ran on vSphere 6 but with VMware Tools from 5.5. The moment he upgraded to VMware Tools 10.x the VIX upload didn't work anymore. The problem only hits a few VMs that can not leverage RPC as alternative upload method.
Published: Monday, 06 June 2016 08:23
Microsoft released convenient rollup update for Windows 7 and Server 2008 R2 in late May 2016. This rollup update will bring your installation to the latest patch level with only a few installations but it has a bad side effect. All VMs running on vSphere (it seems that ALL vSphere versions are affected) and use the VMware proprietary VMXNet3 type of virtual network card will prbably being hit by network issues. The reason behind this is that after applying the update, the OS will create a new vNIC but the settings for the old vNIC will still reside in the registry. This will lead to network problems.
You can get some more details on the link above or here. VMware currently recommends to delay installation of this rollup package until there is a resolution from Microsoft.
Published: Tuesday, 26 April 2016 16:32
With vSphere 6 update 2 VMware fixed several bugs but also introduced a new one. Update 2 includes a critical bug in the VMXNET3 vNIC that can produce PSODs (purple screen of death). The problems occur if:
- the VM is running virtual hardware version 11
- the VM is configured with VMXNET3 virtual nic
- Large Receive Offload (LRO) for VMXNET3 NICs is enabled
Currently there is no resolution but a workaround. All details are published in KB2144968.
Published: Wednesday, 20 April 2016 13:09
With latest Gen9 systems HPE published the new Generation of SmartArray controllers. These x4x controllers like P440, P441 or P841 are the fastest, most feature-rich and most flexible SmartArray controllers ever. With an impressive maximum performance of up to 1 million IOPs and a maximum number of up to 200 disks (P441) these controllers are pretty well suited for Software Defined Storage systems or any other DAS-related solution.
I, personally, use these controllers in DataCore environments where performance and scalability is a key factor and where flash devices are pretty common. Most older RAID controllers are not suited very well for SSD or flash generally spoken since they produce a bottleneck with their CPU or general bandwidth or, most often seen, they simply can't handle flash traffic very well.
The x4x controller series is developed to be used in conjunction with flash so these bottlenecks and limitations should be a matter of the past.
With this knowledge in mind I was pretty surprised during my last DataCore projects seeing some massive performance drops in the I/O backend sometimes. These drops were that strong that the complete I/O stopped during a period between 2 and 20sec. Fortunately most of these drops encountered during the DataCore pool initialization when there was no other I/O on the system except the initialization traffic itself. For explanation, during pool initialization the DataCore software simply writes zeros on the disks to prepare them to be used within DataCore. This is a 100% sequentiel IO stream without any special requirements. It simply writes zeros. So it was pretty clear that nothing could produce these drops except the controller itself.
Published: Monday, 18 April 2016 09:02
For all DCIE the livestop command is a very powerful and often used command to bring the SSY-V software back on track. The livestop command restarts the management services of SSY-V without affecting the storage virtualization layer, so I/O still can pass the system. Formerlay wirh SANmelody or SANsymphony the virtualization layer and the management layer were backed by two different services, so one was able to restart one of them without the other. With SSY-V DataCore merged both functions into a single service making the restart of the management a bit harder to do.
Livestop is normally required by the DataCore support whenever you open a case and something is wrong on the management layer. Unfortunatley in some of our installations livestop is rather common because we heavily use the Powershell to gather information about the DataCore environment to use them in ICINGA/Nagios. Although Datacore offers tons of perf counters but the ones we need aren't native available so we have to calculate them by getting data with powershell and combine them to get what we want.
The powershell implementation seems to have some kind of memory bug because after a while (and this can be only a few days), the management services denies access via powershell. I already wrote about this problem here.
Published: Wednesday, 13 April 2016 21:02
Two of my customers complained about an error after recent DataCore SSY-V updates that fill up their event logs in the DCS GUI.
The error is:
WARNING: A time out has occurred while recording performance data. This could be a result of insufficient processing resources on the selected recording server. Reduce the amount of counters configured in the recording session or select a recording server with sufficient processing resources
I checked all settings but no performance recording was active. To be more specific, not even a recording server was configured. In one environment I'm pretty sure there was never such a recording configured.
Concerned about unneeded I/O on the DCS or any buffer overflow I talked to DataCore support and they told me that this is a known issue with recent SSY-V versions and that this problem is probably coming from formerly configured recordings that are not properly imported in the new config. These error messages are annoying but not harmful at all so you can simply ignore them until the problem is fixed within a new PSP. By the way, even with PSP4 update 1 the problem still exists.
Hope PSP5 that is planned to be available within the next few days the problem is solved.
Published: Wednesday, 30 March 2016 21:55
Actually I'm on vacation this week but the following two bugs should be taken care of:
- VMware once more has a critical bug in vSphere 6 that causes the storage subsystem to not switch over to an alternate path when a PDL was encountered on the active path. A PDL isn't a very common thing, especially only on some pathes of a LUN but it could be so please update to vSphere 6 Update 2 to solve this issue. Check KB2144657 for further details.
- Windows Server 2012 R2 is affected by a critical iSCSI bug that could lead to data corruption in an event of path failover and recovery. There is a hotfix available and should be installed as soon as possible. More information here. Although only iSCSI environments should be affected this could also be an issue in FC SANs so installing the hotfix on all SAN attached systems is our recommendation.
Published: Tuesday, 08 March 2016 10:24
It's been a while since my last blog entry on this page. Don't worry, I'm still alive and will keep this blog updated but other things were more important the first weeks of the year so I disregarded my blogging a bit.
Back on track I will give you a short information about a "new" best practice guide from DataCore. For a long time DataCore was a bit close-lipped regarding iSCSI configuration recommendations. Obviously they focused on FC installations and handled iSCSI as a "should work right out of the box" setup. I remember opening a case and asking support about jumbo frames a two or three years ago. This case is still open.....
But in the meantime DataCore did their homework and created a perfect guide for iSCSI implementations. The main sentence still is "everything should work out of the box" but the details followed by this main sentence are quite detailed. They talk about jumbo frames, receive side scaling, latencies etc. in this 10 page document and it's worth reading it for all DCIE who are about to install iSCSI based SSY-V solutions.
For all other guys like admins this guide is also a must-read and gives you some valuable background information on iSCSI protocol.
You can get the guide at https://datacore.custhelp.com/app/answers/detail/a_id/1626 (you will need a support login to view or download this guide).