Datacore vDisk design

With server virtualization becoming more and more mainstream and new hypervisor versions supporting several TBs of maximum LUN size, one could think that vdisk design is getting more and more unimportant. This should be especially true with SDS like DataCore's SANsymphony-V as these software products pool all available physical disks into a central pool from which all vdisks are created. At a high level a pool is only a simple RAID0 across all available disks so all data will be distributed evenly across all these disks. So why not simply create a single huge vdisk and put the maintenance effort on a real low level? Beside the fact that this woul put all your data at risk if you have a problem with this single vdisk there are some more facts to consider.

 

In DataCore SANsymphony-V there is a global read cache for all vdisks. It will use most of your DCS RAM to store recently accessed data in memory to serve it very fast to other hosts requesting the same data. The size of the read cache can be configured. SSY-V will use 80% of your server's RAM as read cache by default.

Looking a bit deeper and searching for write cache there is only a note that SSY-V will use RAM for write caching but there is no option to configure the size of it. You can disable it on a per vdisk base but you can't see how much RAM is used for write caching. The reason behind this is that SSY-V, of course, will use RAM as write cache but if you would give the user the ability to change the amount of write cache then probably most users would use a considerable amount of the available RAM for write caching. Think about your standard RAID controller with 1,2 or 4 GB of BBWC cache. Normally you spent 20-60% for write caching, especially in a RAID5 or 6 scenario. If you would do the same with SSY-V accessing several 10s or 100s of GBs of RAM would lead to several 10s of GBs of write cache. Sounds cool but what if you have a power failure? RAID controller cache modules are backed by batteries or flash, a server can only be backed up by a UPS and this is sometimes a bit hard to handle. So if you raise the amount of write cache on a DCS then you have to make sure that all the data in the cache can be written to physical disk as soon as there is a problem. Writing 100MB to disk isn't a problem, a GB probably not too but 10GB could be problematic. Additionally, if you have such a big write cache even without any problem, the backend storage has to write down all the cache data someday. If you manage to fill a 10GB write cache then your storage system is probably not fast enough to flush the cache.... not a scenario you want to have.

So DataCore decided to use write caching but to code the amount of used RAM hardly into the program. With SSY-V versions below 10 the amount of write cache PER VDISK was hard set to 64MB, with version 10 this limit was raised to 128MB. With this in mind you see why it could make sense to create several smaller vdisks instead of only few huge ones. Write cache scales only with the number of vdisks.

Beside the write cache there is another important fact to know. We recently made some performance tests with two customer installations. One of the installations only used SAS and MDL SAS disks, the other used additional PCIe flash accelerators. We used IOmeter from a physical box to test the read and write performance on a 4k base with 100% write and 100% read for a single worker on a single, unmirrored disk. The first installation showed about 22.000 IOPs with SSY-V cache enabled and 11.000 IOPs with write through. That was okay for a SAS config.

Switching to the second installation we wanted to see the more powerful PCIe card in action. Same test but the results were almost the same. Nearly exactly the same numbers for read and write, no matter if we pinned the vdisk to the PCIe card or rolled in down to MDL SAS archive tier. We couldn't believe it and carved one of the PCIe cards out of the DCS pool and released it to the OS. Formatting the card with default NTFS and reran the tests. This time the card showed 120.000 IOPs with exactly the same tests. This was what we expected from the card.
The reason behind this difference is a bit hard to technically explain so I will try to explain it in a more "human readable" format. DataCore uses a poller worker on the frontend ports to constantly ask for new requests. These requests were taken and send to a - let's call it - "storage background worker". This background worker is responsible for the communication to the backend storage system. This worker is currently limited to ~20-25.000 IOPs. For each vdisk there is a separate background worker but currently it is not supported to spawn more than one for every single vdisk. In real life that translates to a maximum performance limit of ~25k IOPs per vdisk.
Shouldn't be a problem with a magnetic disk backed pool but as we see flash becoming more and more attarctive this will put a massive limit on the power of flash if you choose the wrong vdisk layout. Once again, you can only scale by using more vdisks.
DataCore is aware of this limit and there is currently a change request to support more than one background worker per vdisk to fully use the power of flash. But in the meantime you won't be able to use all performacne in a single vdisk.

Unfortunately this two facts never showed up in any design guide from DataCore so perhaps you just followed VMware's or Microsofts best practices and created few huge datastores. Probably you could get more performance if you switch to a "rather more small disks than few huge disks" setup.

 

To cut a long story short, rather use more and therfore smaller vdisks than only a few huge disks. Huge disks are still the way to go if you need contiguous space for a single VM but to scale performance splitting VMs into smaller vdisks is the way to go.

Leave your comments

Post comment as a guest

0
Your comments are subjected to administrator's moderation.
  • No comments found
Powered by Komento
joomla templatesfree joomla templatestemplate joomla
2017  v-strange.de   globbers joomla template