Common MS SQL cluster migration problems from baremetal to AWS

Published in

DevOops World … and the Universe

10 min readJun 24, 2016

For the last couple of weeks I was helping my colleague with some unusual task for me. He had some environment based on Rackspace and wanted to migrate it to AWS cloud because he had some problems with Rackspace support and was unhappy about it. The infrastructure was pretty simple, some web-based application that connects to MS SQL database. Yes, you heard it right. It was a Windows-based environment with IIS as a Web server and MS SQL as a database. The most interesting part is that he invited me to help him with his project because he wasn’t familiar with AWS, and though, I wasn’t working with Windows for a long time. So basically this is a story of some problems that we faced during our migration, where each one of us knew only their part.

Initial planning

Before we agreed about the final infrastructure that we would launch on AWS I jumped into some environment details on Rackspace. So basically it was something similar to this:

Web server based on IIS on a separate host with pretty simple configuration.
Two hosts with Windows Server 2008R2, and SQL Standard edition installed working as SQL cluster with shared storage.
Separate Domain and Active Directory.

So I proposed quite a common scheme on AWS — to create a separate VPC with one public zone (for the web server) and two private zones in different availability zones (for SQL cluster and AD) for reliability. Moreover, we can use AWS directory as Active Directory, Route53 as domain management and RDS for Database cluster. We agreed that I will take care of AWS and my colleague will configure everything related to Windows. This plan looked really great, but …

“Everybody has a plan until they get punched in the mouth.” (c) Mike Tyson

1st problem — compliance

When you work every day with some bleeding edge IT technologies, you always forget that there is some IT security compliance that may destroy your perfect deployment plan. In our case, it was HIPAA compliance. But the real news for me was that RDS is not HIPPA compliant with MS SQL (only with Oracle and MySQL engines).

So that was the first punch in the face for me. Considering this, we had nothing except to change our infrastructure to use EC2 MS SQL instances and configure SQL cluster manually. I was a little bit disappointed about this, but my thoughts were: “Okay AWS, not a big deal, we’ll have our own SQL cluster with blackjack and Ops engineers who will manage it”. Our plan remains the same except we agreed to launch two more instances in private zones to serve as SQL cluster.

I will not describe here the process of creating VPC, connecting our domain to Route53, launching AWS directory, granting some user policies on AWS etc. I supposed that you guys already know how to do it, or at least could find some guides about it.

2nd problem— shared storage

Since I wasn’t not involved in configuring SQL cluster, I did my job on AWS: launched two DB instances based on Windows 2008R2 with SQL standard edition, defined security rules and provide all needed information to my colleague. After a couple of hours he asked me quite a simple question: “Could you create a shared storage on Amazon? I need it for my cluster”. Damn, another punch in the face:

Standalone SQL Server instances usually store their databases on local disk storage or nonshared disk storage; clustered SQL Server instances store data on a shared disk array. Shared means that all nodes of the Windows Failover Cluster are physically connected to the shared disk array, but only the active node can access that instance’s databases. To ensure the integrity of the databases, both nodes of a cluster never access the shared disk at the same time.

As you know, Amazon launched Elastic File System (Amazon EFS) service which should be just a perfect solution for our cluster, but here comes the problem. If you didn’t read the references carefully you may not find that Windows is not supported in EFS

And EC2 volumes can be mounted only to a single instance, so we can’t use them as a shared storage for the cluster. That was a time when I really started to search some solutions, and didn’t find much except:

Use Microsoft SQL server 2012 / 2014 /2016 with AlwaysOn (it doesn’t require shared storage, though it’s recommended)
Use some 3rd party solutions as SIOS (recommended by AWS and Azure)

We were not happy to use some 3rd party solutions, so we decided to stay on the Microsoft native one — AlwaysOn.

3rd problem — Enterprise

Probably the 3rd problem appeared cause we didn’t have much time to read everything carefully, especially if we are talking about Microsoft and features. I found that AlwayOn supported on Windows Server 2012 or later versions, and WSFC (Windows Server Failover Cluster) should be set up.

So we didn’t waste more time and launched Windows Server 2012 with SQL Server 2014 Standard edition. Spent a couple of hours to set it up again and found out that AlwayOn Availability Groups are not supported on Standard edition. Microsoft, are you for real?

As they guy who worked with Open Source solutions mostly for the last 6 years I was really confused about it. But we were in the corner, and we saw the easiest way to get out — use SQL enterprise edition.

4th problem — no AMI?

Those who often work with AWS could remember the EC2 pricing page even if someone would ask them at night. So do I. That’s why this image looks very familiar for me:

And I was very confused when pressed “Launch Instance” button and didn’t find Windows with SQL Enterprise. Come on, AWS, another joke? Nope, it wasn’t. Spending some time I found that enterprise edition could be launched only via AWS Marketplace. I don’t know the reason for this, but it was another thing that I didn’t know about Amazon. The whole Windows-licensing part looks like a nightmare to me.

5th problem — joining AD

This problem is quite a minor one, but could be a splinter for you if you’re a perfectionist. When you have your AWS Active Directory up and running you can launch your instances directly to your domain, without adding them manually. This is a good thing from AWS, and I glad that they made it work but when my Windows Server launches, AWS gives some random name to this machine (like MVK812316) and automatically adds this machine to the AD domain. So if you have more that 1 instance it would be hard to find it in the domain. Not like we planned. So we joined every instance to the domain manually firstly changing the Windows name to some relevant like “DB1”, “DB2”, “Web”.

6th problem — internal DNS records

I’ve been working a lot earlier with internal DNS records on Amazon and it was really straight and clear for me. But my mistake was that I didn’t have much experience with AWS Directory, and that punched me in the face again. So, as usual, I’ve created internal hosted zone (for example test.local), and attached it to our VPC. I set a couple of records (like db1.test.local, db2.test.local, web.test.local) which should point to my instances, but with no success. I mean, when we tried to resolve this name using nslookup via Windows CMD we got “non-existing domain”. I’ve read a decent amount of AWS manuals but all of them were saying that DNS requests from AD DNS servers will be forwarded to AmazonProvidedDns (which is 10.0.0.2 if your VPC is 10.0.0.0/16) which connected to my private hosted zone. But it wasn’t working. The problem was that Amazon forgot to tell that it will be forwarded only if private hosted zone is different from AD Domain name.

Any request sent to these IP addresses is forwarded to the VPC-provided DNS service and Route 53. Setting up DNS forwarders in your on-premises DNS service to these IP addresses for your Route 53 domain names is an easy way to realize immediate DNS resolution from on-premises hosts into your AWS VPC. Be sure to check the security group created for your directory to ensure DNS traffic is allowed from your on-premises networks. Also, make sure the Route 53 domain name is different than the Simple AD domain name. If they are the same or if the Route 53 domain is a subdomain of the Simple AD domain, Simple AD does not forward the request.

The solution was as simple as it could be. We just needed to connect to AWS AD directly from our Windows instance using default DNS Manager, and configure all the DNS records there (so private hosted zone is not required).

7th problem — private IPs

As you may know WSFC configuration requires three IP addresses to operate normally:

One IP address will be used as the primary IP address for the instance.
A second IP address will act as the WSFC IP resource.
A third IP address will be used to host the AlwaysOn Availability Group listener.

When I launched our DB instances I didn’t create or set up any private IPs, with some concerns to do it later over EC2 Network interfaces. We agreed that we want to use relevant IP’s for both servers (like 10.0.10.101 and 10.0.11.101 for WSFC cluster IP and 10.0.10.102 and 10.0.11.102 for AG listeners). Too bad that we agreed with it too late. It was a discovery for me that you can’t change the primary IP address on the EC2 network interface. You can add/remove secondary addresses, but it doesn’t allow you to do anything with the primary.

Moreover, you can’t even detach primary network interface from the instance (even if it’s stopped) and try to attach another one with the correct IP. Yep, it hurts.

I guess we could leave it as it was, but we wanted to make it nice and clear so we recreated those instances, but this time we set those private IPs during launch.

We faced another problem with private IPs, but this time mostly related to Windows. When you configure secondary IPs on AWS, and do not change your Windows TCP/IP setting to use “static”, if fails to configure WSFC with this kind of error:

Cluster IP address resource 'Cluster IP Address 10.0.11.101' cannot be brought online because a duplicate IP address '10.0.11.101' was detected on the network.  Please ensure all IP addresses are unique.

But this time it’s quite logical — the IPs that you would you use for your cluster should not be present in the system.

8th problem — listener

I had a feeling that we are almost done it, though it was really painful. But it wasn’t the last problem that we faced. The last one was actually — creating the listener. Amazon WSFC guide said it pretty clear:

Here how it looked from our side:

The problem is that DNS records not created automatically when using Simple AD. And yet another discovery for me that DNS dynamic updates are not supported in Simple AD domains:

You may change it manually using DNS manager or change your Active Directory to Microsoft AD which is more expensive. After we’ve done it manually using DNS manager is worked. Here how it supposed to be:

image taken from AWS WSFC configuration guide (example)

Summary

In the end, we made it work, despite I didn’t believe in Microsoft. I was a little bit surprise for me that it actually runs smoothly with no errors. We connected our application to the SQL cluster and made some testing with WSFC and AlwaysOn. It was forwarding requests correctly even if we stopped one of the DB instances. Production test showed that it takes close to 13 seconds to switch requests to another instance if the primary group fails:

It was a really interesting experience for me because I found some issues that I didn’t know before. Hope it helps you to avoid those face-punches that I got for the last week if you will suddenly decide to migrate your MS SQL cluster to Amazon.