In January this year I already talked about problems while implementing a Wide Area SAN. The problems with our implementation seemed to be resolved with the steps mentioned in my article but unfortunately they hit us again after a while. This time we were faced to intermittent performance problems, hang-ups and unplanned resyncs in the DataCore SAN.
Looking at the software did not show any misconfiguration or factor that could lead to such an unstable environment. Nevertheless we reconfigured the storage behind the DataCore software to provide maximum performance over capacity. Things got a bit better but every now and then still we had these terrible performance hits.
Looking at the SAN switch level we did see some CRC errors and enc_in/enc_out so we asked the provider again to test the lines. They ran a 2h test over the lines but nothing happened. Speed was perfect, latency was extremely low and no errors at all. Reconnecting our stuff and waiting for 1-2 weeks again showed up multiple errors. As we can only control our environment we stopped and restartet the FC ports used as ISLs but that didn't have any positive effect. Error counters suddently climbed up to several thousands (don't care about the number, problems can even start if you have a low number of errors as these are first signs for upcoming problems) and rendered our SAN nearly unusable. Sometimes deactivating one fabric and only using the second fabric solved the problem for a short time but always running without redundancy can't be the solution.
During one of these "problematic phases" we talked again to the provider and, as always, got the answer that they do not see any errors on their side. Nevertheless we asked them to reset the line card in their WDM equipment. Nearly instantly the performance went back to normal. Crazy, switching the ports on our Brocade switches did not do anything but doing the same on their WDMs instantly solves the problem. Well, we thought of an error in the line card of the WDM so we asked the provider to replace the line card. After a short discussion they agreed and changed the line card. Things ran smoothly for several weeks. Suddenly again the errors on our ports started to grow and once again hit the performance of our SAN (as we are highly dependent on a stable and fast ISL between the two DataCore nodes). Learning from actions in the past we asked the provider to reset the linecard and again, the problem was solved. Unfortunately it was the line card the provider already replaced so how probable is it that two line cards suffer the same error?
During on of the escalation telephone conferences we got a hint from one of the providers specialists that these symptoms could be caused by incompatible SFP modules in the WDM and the Brocade SAN switches. As Brocade has a list of qualified SFPs you are allowed to use we decided to change the SFPs in the WDMs. The WDMs take any SFP so we took exactly the same SFPs for the WDM as for the SAN switches. Since the day we changed these SFPs no errors AT ALL are shown in our switch logs. So it seems that these environments heavily depend on a proper interaction between the SFPs. It seems that if you use "incompatible" SFPs things will work but eveery now and then there will be some problem happen that forces one of the two componentes to inject errors in the data stream but won't log ANYTHING. For theses SFPs, everthing is fine but the data stream is quite unusable.
This example also shows you that testing the lines for 1-2 hours is useless. These errors will happen after several days (or more) and several millions or billions of packets. Normally you should test the lines for several days before taking them into production just as you would do with a critical server's memory. A memory test run for only a few hours and only a single pass isn't an evidence for an error-free operation.
Another word to some forum entries around what is an acceptable error count ISL ports. No matter if you talk about CRCs, enc_in, enc_out, discards or even buffer low credits, don't get fooled. The only acceptable number of errors in a production environment is ZERO! Everything else will hit you sooner or later hit.
We have our lessons learned, perhaps these articles can help you avoiding the same.