Dealing with connection issues in a FC SAN environment isn't my favorite thing and dealing with random errors that are logged in vCenter's event log definetly belong to the things I don't like at all. These errors especially if they occur without any pattern on a random base are quite difficult to troubleshoot because you can't estimate when they occur the next time. So observing is hard and checking if your changes had a positive effect is even harder.
This time a customer called me and told me he gets several of the "lost access to volume due to connectivity issues" and a few seconds later the corresponding "Successfully restored access to the volume following connectivity issues". The errors pop up in vCenter's event log and mainly occur during the night. So my first guess was a storage problem related to high load during backup sessions. Before I even could setup some test monitoring systems the customer did some "self-service" and moved the some VMs to another host because only three datastores were mentioned in the error logs and all VMs running on these datastores ran on the same vSphere host. So VMotioning the VMs was a valid option. After the move the error keep popping up but this time not only during the night but rather all over the whole day. The time between the errors was random, no pattern. But the interesting thing was, after the VMs were moved to another host, the errors stopped occuring on the former host but started to pop up on the destination host. So the error seemed to move with the VM.
What was so special about these VMs? Well, the all ran on the same datastores but there were other datastores as well and none of them caused such error messages. The next thing to keep in mind was that the error only occured on vSphere hosts that actively use the datastores.
Doing a quick search at Google resulted in some possible explanations but none of them seemed to be the real cause. HA and datastore heartbeating was one of the possible troublemakers but excluding the mentioned datastores from HA heartbeat didn't solve the problem. Other possible reasons like broken cabling, SFP problems or HBA errors also could be excluded as everything seemed to be fine in the SAN.
Next step was to find the difference between the datastores affected by the problem and those not affected. That was a good idea, the affected datastores all came from two new MSA2040 storage systems whereas the unaffected datastores are all served by older P2000 systems. Checking the path policies and everything around the MSA2040 datatstores showed no problems. So what is the difference between a MSA2040 and a P2000? Both are FC models but the P2000 runs on 4GBit and the MSA2040 on 8GBit. The SAN switches are all based on Brocade 300 models running on 8GBit linespeed. The vSphere hosts are all HP ProLiant servers with 8GBit FC HBAs.
Whenever you see such a combination and you encounter problems within the SAN there should always hammer a word through your brain: FILLWORD.
Fillwords on a Brocade switch are packets that are sent during idle times where no traffic at all occurs. With the upgrade fom 4GBit to 8GBit the default fillword used to be sent during these idle times changed from IDLE to ARB, a more suitable fillword for the increased speed. Unfortunately some major storage systems had massive problems with the new fillword so Brocade changed the default fillword back to IDLE. This mode was known to cause much less trouble in 4GBit environments. The problem here is, the newer storage systems do a better job with ARB than with IDLE. Using the default IDLE fillword with modern 8GBit based storage systems can potentially cause problems and that is the reason you see the vCenter errors mentioned above. The MSA2040 works better with ARB as fillword whereas the P2000 works with the IDLE fillword. As the default fillword is IDLE you have to switch it wo ARB on all ports your MSA2040 is connected to. Changing the port of the vSphere host HBA isn't neccessary but it isn't problematic either.
You can change the fillword on a specific port by connecting to the SAN switch via SSH (console is the only supported way to change the fillword settings, no way to do it with the GUI today) as admin or root and send the command "portcfgfillword portnumber 3". Replace portnumber with the portnumber of the port you want to change. The 3 is the fillword mode where 0 is IDLE and 3 is ARB. 1 and 2 are rarely used.
After the fillword change all errors are gone.