DataCore with direct attached storage - take care!

Using DataCore's storage virtualization software in conjunction with direct attached storage is a very cost-effective and quite easy to implement way to get a fairly fast SAN.

I myself love to use this kind of setup but there is a problem that you should be aware of.

When using direct attached storage you normally use some kind of RAID controller to get redundancy at a hardware level. On top of this redundancy you put SANsymphony-V to mirror all your data. This way you can be quite sure you will almost always have access to your data.

Well, you can't!

 

The problem is, SSY-V doesn't take care of hardware failures. SSY-V relies on the underlying hardware to function correctly. There is no hardware error checking done by SSY-V. This means the hypervisor is fully hardware agnostic.

All kind of error checking has to be done by the hardware itself. Beside using RAID to handle complete hard disk failures, modern RAID controllers use features like "Dynamic sector repair" (as called by HP and their SmartArray controllers) or "Disk scrubbing" to continuously check harddisks for bad blocks or other creeping errors that can render your harddisks inoperable.

In redundant configurations (like RAID1/5/6) if DSR finds bad blocks on one of the array disks it will rebuild the data from the other disk(s) in a new are on the affected disk. If DSR doesn't work for any kind of reason it is possible that bad blocks are unrecognized and data will get corrupted by time. This can happen even in a redundant configuration. The simple RAID function will not help you here!

If you access an unrecognized bad block on a harddisk in SSY-V, the vDisk on the DCS accessing the bad block will immediately be set offline to prevent further data corruption. This is not a problem in a fully mirrored environment as the second DCS will take over but what about non-mirrored vdisks or accessing the bad block during a mirror recovery? Exactly, your only valid data source will be set offline.

This can't happen to you? Well, I always thought the same way until I was faced with this problem two times within a few weeks. In both situations bad blocks were not recognized, one of the DCS shut down uncleanly thus resulting in a full data recovery. During the recovery several vdisks containing bad blocks were set offline on the surviving node.

By the way, you can reset a vdisk that was set offline due to bad blocks only by rebooting the DCS that holds that vdisk. This will "erase all knowledge" of the former error for this DCS and it will set the vdisk online again. Nevertheless, the problem still remains. As soon as you access this bad block again (and you definetly have to if you want to restore your mirror) the game starts over. And as long as some of your vdisks are still in resync you can't afford to reboot your only working DCS.....

To cut a long story short, if SSY-V ever denies access to a vdisk because of bad blocks you have no chance to reliable restore access to this data on this DCS. You can try to reboot the DCS, regain access to the vdisk and move the data to a new location. If you are in luck, the data on that bad block isn't used anymore or is unimportant. But say good-bye to a easy and fast way to recover or remirror all your data from that LUN.

Why does this happen? Do you remember the feature called DSR I spoke about a few lines above? This feature should take care that such situations never happen. The problem is, this feature is quite performance intensive because in a regular manner all blocks from all disks have to be checked and this takes time and ressources.

That's the reason why nearly all vendors activate this feature by default but set it to run only in periods of low activity on the system. To be more specific, HP has a default setting to start the DSR only if the disk has no activity for 3-15 sec. And it will immediately stop the check if there is I/O sent to this disk.

In a DataCore environment where all disks are pooled in some kind of RAID0 there is probably never such an inactivity on the disk. That's the reason why DSR will never start.

It is absoluty normal that hard disks get bad blocks. You can't avoid that. That's why all hard disks have some reserved space to handle a fair amount of bad block data. But this relies on check programs like DSR. You should also never think you are save just because you use RAID1/5/6. If there is no DSR running it isn't impossible that all redundant data lies on bad blocks. Believe me, this happend two times to me so I assume the risk is quite high (or I'm really out of luck....).

So what to do? The answer is quite simple: you have to set your check program to run at a high priority. This can have an impact on your system's performance but is the only chance to avoid such a dead lock.

Why is only direct attached storage affected? Because all SAN storage systems I know do not rely on times of inactivity. They simply start them on a scheduled base, ignoring impact on performance. Data integrity is more important than performance! 

So keep that in mind whenever you use DAS in a DataCore system!

Leave your comments

Post comment as a guest

0
Your comments are subjected to administrator's moderation.
  • No comments found
Powered by Komento
joomla templatesfree joomla templatestemplate joomla
2017  v-strange.de   globbers joomla template