Software Defined Storage Reference Architecture: VMware VSAN
This year will be a year of growth the Software Defined based infrastructures. Especially we will see many implementations of Software Defined Network (SDN) and Software Defined Storage (SDS) deployments. Key players around of SDN and NFV concepts expect huge consolidation and acquiring. In 2015 we will see clear rank of them with top player, and as results frozen vector of development for these concepts. In its turn, SDS due to its capabilities as shared nothing, flexible multilevel duplications and usage of commodity hardware, will finally break up traditional SAN paradigm.
As each new extraordinary distinct concept, SDS needs potentially huge investment in its explanations to implement adoption of new paradigm of market. Let me present reference architecture for several products that already available for production usage. In this, first of series, article attention is paid for the VMware VSAN solution.
For proper design any storage system is highly desirable to know fine behavior of future workloads such as required IOps, write/read ratio, working set size and how much IOps falls on it. Knowing these values allows calculate appropriate percentiles of hot blocks, correctly determine required amount of cache SSD and storage HDD drives and its sizes, interface speeds, write performance and durability classes. For those cases when is not possible determine behavior of future workloads, VMware recommend assume working set size as .1 of total required usable capacity in general way.
For proper design any storage system is highly desirable to know fine behavior of future workloads such as required IOps, write/read ratio, working set size and how much IOps falls on it. Knowing these values allows calculate appropriate percentiles of too frequently used hot blocks, correctly determine required amount of cache SSD and storage HDD drives and its sizes, interface speeds, write performance and durability classes
VSAN is VM-centric oriented system and replication level or failure to tolerance number can be achieved with granularity as each VMDK file. For simplification, in this example for all hundred VMs this design will able tolerate 1 failure. Note, that for SDS concept this mean that projected storage must tolerate failure of one failure domain for availability of VM and its data. For VSAN 5.5 one failure domain is always a node. Thus, each data block will be written twice on different nodes on local drives and failure of one node will result in outage just for one replica of stored data. ^UPDATE: since VSAN 6.0 release our failure domain can be configured and it can be, for example, a rack.
All generated IOps at first always cached into SSD drive. After that via proximal IO algorithm writes that is approximately close to each other will be destaged from SSD cache to magnetic drives in sequential manner, thus improving performance. Determine right size of SSD space allows retain too frequently used hot blocks in cache and destage only cold blocks. ^^UPDATE: VSAN 6.0 cluster can now be on of two types: present hybrid type, which comprises magnetic and SSD drives, and all-flash which leverage the same principle but with high endurance ultra speed SSD and cost-capacity SSD drives. Additional important distinction is a cache layer (high endurance ultra speed SSD) is used only for writes, since all read requests are directly served by capacity layer (cost-capacity SSD).
VSAN puts each disk file into object. Than object is duplicated into components as so many times as how many failure to tolerate number. For 1 failure to tolerance, object is duplicated twice and forming a pair of components with additional one meta component — witness. Witnesses are small components, just about 2MB, which intended to deal with possible split-brain cases.
via proximal IO algorithm writes that is approximately close to each other will be destaged from SSD cache to magnetic drives in sequential manner
Further, VSAN algorithm places these three components on different nodes. Note, that one object can not be large than 255GB and therefore disk files that are more than 255GB will be split evenly on objects limited to 255GB. Also possible forcibly evenly split one disk object into multiple sequential components even if they are smaller than 255GB. You can find illustrated this behavior on the drawing below.
This reference architecture is just an example and it was prepared taking as a condition for placing 100 VMs with below parameters for each.
Virtual CPU: 2
Memory: 10 GB
Disk space: 250GB
Disk working set size: 10GB
Required IOps: 150
IOps in working set: 135
Write/Read Working Set ratio: 1:3
To be able to host above given workload designed system must provide more than 27TB usable capacity, 1532GB cache, forty 7.2K RPM drives and deal with 19000 IOps that falls on hot blocks area. Reference architecture for this design can be four nodes with such configuration each:
Processor: 2 x 6-core CPU
Memory: 24 x 16GB
Disk controller: 1 x LSI SAS 9207–8i controller
Magnetic drives: 10 x 2TB 7.2K Near-Line SAS 6Gbps
Solid State drives: 2 x 200GB SSD SATA eMLC 6Gbps
Network controller: 1 x Intel X520 DP 10Gb
Boot flash drive: 1 x Internal 8GB SD Card
Also assumed availability of two 10Gbps leaf switches to connect this nodes to one distributed cluster. More about how build large VSAN cluster read in my article here.
Of course example of given workload is just an example and can not be correlated with every specific implementation which requires proper design for each real task. So, I am very pleasure to invite you to discuss the details of VSAN planing and sizing during online #LearnSDS Talk which is scheduled on Saturday February 7. We will discuss step by step general methodology how to calculate hot blocks percentiles, amount of SSD and HDD drives, their write performance and durability classes and allocation across hypervisors.