Building an All Flash SAN with ScaleIO: The Quest for 800,000 IOPS
The goal of this posting is to share my experience with building a ScaleIO platform specifically designed for high performance MS-SQL operations. This is a summary of work that was done over a four month period with Jason Pappolla (Network Administrator), EMC L3 support team, and myself.
The company I work for has had some history with other EMC products that includes a CX4–120 and our current production SAN a VNX-5400. So when we started maxing out the storage processors of our VNX we wanted a solution that would scale easily and do away with the “forklift” upgrades that are associated with traditional SAN solutions. EMC introduced us to their ScaleIO platform and allowed us access to their support team for our test build. The ScaleIO software is unique in that EMC gives you the option to buy either pre-built turn-key nodes or just the software and you supply the hardware. After looking at the performance specs of their all flash nodes, we came to the conclusion that we could build a node that was faster and more suited for our applications. With that being said, EMC all flash nodes are built like tanks. They have ten SAS 1.6TB SSD’s that have an endurance rating of 10 DWPD (that’s 29.2 PB!). However, they have a single E5–2650 v3 (10 core) CPU and are rated for only 100,000 IOPS. We knew from our test platform that we could easily hit 200,000 IOPS if we utilized NVMe SSD’s.
Understanding the Limits of the base OS
One of the biggest strengths of the software is that it is platform independent. The flexibility of using either ESXi (Hyper-Converged), CentOS, or Windows Server as the underlying OS is great so long as you understand the pros and cons of each. I believe ScaleIO performs best using a minimum install of CentOS 7 (Two-Layer), but for ease of deployment (like a test environment), using ESXi is the way to go. However, the base operating system is driven mostly by the environment it will be deployed into. For us, we are primarily a Windows VM environment running on ESXi 6.0 hosts. The decision was made to use Windows 2012 R2 as the host operating system for the storage nodes and connect to the ESXi cluster with a dedicated SFP+ storage network. We learned a hard lesson that I will go into greater detail in a later post, but for now, I want to just stress the importance of fully understanding the default configurations of a Windows 2012 server.
Hardware Limits VS Software Limits
The first thing to take into account are the hard limits of a node that is set by the ScaleIO software. Each node has a max capability of 260,000 IOPS. Because this build was going be 100% flash, we knew we could easily hit this limit and end up wasting performance. Finding the right combination of PCIe-Flash and SSD drives that peak out the software will dictate what type of enclosure you will need. The ScaleIO software introduces addition variables that must be considered in order to build a balanced system.
Mainboard and Enclosure
The key point to remember here is it is better to spread the same amount of SSD’s across many smaller nodes than using fewer nodes with more drives. Enter Supermicro, with their FatTwin multi node line. These enclosures allow for up to 8 server nodes within only 4U’s of rack space. In the end, we decided on the 4 node enclosure with X10DRFR-NT motherboards.
The nodes came with integrated 10Gbit ports and LSI-3008 HBAs that are ideal for software defined storage systems. So what does all this get us overall? Four nodes with a total of 24 SAS/SATA bays, 8 hot swappable NVMe bays, and eight E5-2600 V3 processors. With the right SSD’s and PCIe flash we will be able to reach the upper limits of each node.
Picking the Drives
The operating system resides on a pair of SATADOMs in RAID1 using the on-board controller. The decision of what SSD to use for storage took a lot of research. The test environment we built used consumer grade 2TB Samsung 850's and 1.2TB Intel 750 PCIe flash. The performance of the 850’s was impressive, hitting 140,000 IOPS using only 6 drives (not bad for consumer grade). Since we got great results with the 850’s on the test platform we decided to stick with Samsung but stepped up to their enterprise drives for the obvious reason of write endurance.
We narrowed it down to either the high capacity 3.84TB PM863 or the higher endurance 1.92TB SM863. We did look at IOPS and throughput performance between the drives, but due to the software limits of each node this become almost a non-factor. Because ScaleIO is a software defined storage it allows us to see the workloads placed on an individual SSD and then put performance limits on the associated volumes. This is a huge advantage when using SSD’s because we now can easily control DWPD on the fly through the user interface. This type of feedback on drive performance along with the ability to control it was a major factor in the decision to go with the 3.84TB PM863.
This was the biggest learning curve of the build. ScaleIO custom made their own protocol for SDS/SDC communication that has some unique characteristics. We initially approached the network topology and tuning as if it was a derivative of iSCSI. We isolated the SDS back-end traffic from the front-end and segmented it into different vlans. My understanding of the ScaleIO protocol could not have been more wrong. After some research, I discovered that it was a type of inter-node protocol (also known as Gossip protocol). These types of protocols are common in large distributed systems to efficiently communicate with every node in a cluster. We didn’t run into this issue during testing because we used a flat network. The EMC team explained that this approach could be used but is usually reserved for large scale deployments using a Leaf-Spine topology. Since we were deploying only 4 nodes and the storage switches were physically isolated, a flat network was the most ideal layout. This is the single best document that will help you with your network layout design.
We determined a node operating at max performance utilizing all SSD’s and PCIe flash would require four 10Gbit ports in order to prevent a bottleneck. To calculate how many ports each node requires, you multiply the number of drives by the drive’s throughput of sequential operations. This number should closely match to the total throughput of all ports (be sure to convert from bits to bytes). Supermicro makes a quad port 10Gbit SFP+ NIC that allowed us to keep one PCIe slot free for future expansion. The quad port NICs are available in both Broadcom or Intel chipsets. We didn’t see any real benefit over the other and went with what our vendor had in stock. This decision would came back to haunt us. We started having connection losses randomly across all nodes. After scratching our heads for a week, we discovered there is a bug with the Broadcom chipset and Windows 2012 R2.
The bug had to do with how Windows NIC teaming and the Broadcom driver interact. Apparently this is a fairly well known bug within the Hyper-V community. I have no experience with Hyper-V so if someone can shed some light on the issue I’d love to hear it. We ended up just replacing the Broadcom NICs with the dual port Intel SFP+ cards out of our test server. Problem solved and had our vendor overnight us eight Intel NICs (Four X520s and four CTG-i2S) . The only downside to this fix was that we had to use the second PCIe slot.
It never ceases to amaze me how much software licensing can affect the end design of a system. EMC offers the ScaleIO software free on their website but with only community support and non-production use. Since this SAN will be running a tier 1 application, enterprise support was an absolute must. The ScaleIO licensing scheme is based on the amount of raw storage. If you are doing an all flash build like this, over estimating your storage needs will add considerable costs in both licensing and hardware. In our case we knew we needed 25TB for our reporting servers, but if the system performed to our expectation, we would start migrating our live production environment onto it. Moving our production servers would require an additional 10TB of SSD storage along with a 3–4TB pool of PCIe flash for extreme performance. When you look at the cost breakdown of the hardware for this design you will see that 69% of the cost are the drives. So we built the nodes to handle all the drives but populated only half of the bays. This reduced the upfront cost of the hardware by 47% along with only having to license 62TB as opposed to 95TB. We can now increase our storage needs in increments of 6TB by simply adding drives.
Production Build Bill of Materials
Test Platform Build and Performance
The ScaleIO test platform consisted of three R710s that were almost 5 years old, LSI 9300–8i HBAs, six consumer Samsung 850's, and 3 Intel NVMe PCIe cards. We used ESXi as the host operating system with a 10Gbit SFP+ dedicated storage network.
After doing some standard tweaks to the network and adjusting the Queue lengths of the LUNs to optimize them for flash, we saw major increases in performance. We had three Windows 2012 test VMs running IO-Meter. Each test VM consistently posted 65,000–70,000 IOPS and reached a max of 212,805 combined IOPS.
I didn’t think that this blog would be this long so I have decided to split it up into multiple parts. Deploying the ScaleIO on a Windows 2012 host is not done very often and trying to find information was scarce at best. In Part 2, I’m going to go into detail of all the issues we encountered that are unique to a Windows deployment. I Will also post all the Powershell commands/scripts that helped us with the troubleshooting.