Successfully Upgrading Both Windows & SQL Server Without Downtime

Saurabh Vegda
4 min readNov 16, 2018

--

Online shopping is a 24-hour business, and our top priority in technology is to provide a quality shopping experience for our customers on Nordstrom.com. So what do you do when you’re asked to upgrade your database servers by migrating to a different Windows domain?

“Yes, We're Open sigange” by Alexandre Godreau on Unsplash

To solve the problem, we are going with a phased approach, first upgrading our current stack to Windows Server 2016 and SQL Server 2017 with no downtime, and then migrating out to another Windows domain using a Distributed Availability Group.

Our team is responsible for managing database servers that support Nordstrom.com, by ensuring data availability, integrity, and consistency, as well as ensuring database performance is fast enough for the website. When we looked at this challenge, our stack used Windows Server 2012 R2 and SQL Server 2014 SP2. The figure below shows the setup we had for Windows Server Failover Cluster (WSFC) and the Always On availability group to achieve high availability.

Pre-upgrade State

In our infrastructure, we have seven sets of systems using WSFC and Always On availability groups. Each set has two nodes, except for one set with four nodes. Along with this, we have a dedicated distributor and 12 additional servers acting as subscribers behind a load balancer which take read-only traffic. All of the servers mentioned above were upgraded at same time with minimal downtime resembling a fail-over from one node to another.

In Windows Server 2016, a new feature called cluster operating system rolling upgrade was introduced which allows us to update the OS while a cluster is running without stopping the cluster. It also means that we can add an additional Windows 2016 node to an existing cluster of Windows 2012 R2 and still use existing cluster objects from Active Directory.

In past this was not possible, which meant we could only perform an upgrade by creating a new cluster and its objects in parallel, then replicating data using a method like log shipping. During the time of cut-over we would have to plan for downtime to perform the upgrade while we bring the system back online. This would normally take 3-4 hours — an unacceptable length of time.

Using the cluster upgrade feature allows us to set up new Windows Server 2016 nodes in existing clusters ahead of time without impacting existing production systems.

We start by installing SQL Server 2017 on new Windows 2016 nodes. Once those nodes were added to cluster, we extended the existing Always On Availability Group by adding a new node as an additional replica. Note: use caution here in Availability Group settings so existing SQL Server 2014 replica won’t accidentally failover to SQL Server 2017 replica.

Result? We had our new systems fully running prior to the scheduled upgrade. The figure below shows our transient state with mix of Windows Server 2012 & Windows Server 2016 OS and SQL Server 2014 & SQL Server 2017 Always On Availability Group.

Transient State, Immediately Before Upgrade

Next steps — our upgrade was scheduled during a time of low product related changes to minimize stale product related information on our website during the upgrade time period. Our upgrade path included disabling replication to remove any reference of Windows 2012 server meta data, and completing failover activities to the upgraded version of Windows Server and SQL Server. After successful failover, re-creating replication is necessary to get all the subscribers upgraded to newer version.

During the SQL Server upgrade, it’s important to failover from the SQL Server 2014 replica to the SQL Server 2017 replica in the same way as regular failover occurs between two nodes. Once that is successful, all post-upgrade tasks related to consistency checks and index rebuilding can be completed. Failover might take longer depending on how many transactions are in flight. Most of our failovers were completed in the range of 10–30 seconds.

After the SQL Server upgrade, cluster node ownership failover to the Windows Server 2016 node. This is the transient state for WSFC where node ownership can fail back to the Windows Server 2012 node. Fail back to Windows Server 2012 node can be done until the cluster functional level is upgraded, after which only Windows Server 2016 can be used.

Once all the upgrade work is complete, cluster functional levels can be updated to use any new feature of Windows 2016. The image below shows our state of system post-upgrade.

Post-Upgrade State

In closing, to perform a successful upgrade on cloud infrastructure with no downtime, we:

  1. Create new Windows Server 2016 nodes with Failover Cluster enabled.
  2. Install and Configure SQL Server 2017 on the new nodes.
  3. Add the new Windows Server 2016 nodes to existing cluster of Windows Server 2012 R2 nodes.
  4. Add new Windows 2016/SQL Server 2017 nodes to the existing Availability Group as an additional replica. At this time, there will be a mixed-mode config
  5. Perform the database failover, upgrading the database to SQL Server 2017.
  6. Remove existing Windows Server 2012 R2/SQL Server 2014 nodes from the Availability Group and Failover Cluster.
  7. Verify that failover settings at database level and at cluster level match expectations.
  8. Upgrade the cluster functional level and complete the upgrade work successfully.

--

--