Why I have high hopes for the quality of SQL Server 2016 release

SQL Server 2016 will be the first Cloud First release. This is not just a buzz-word fest. I’ve been through many SQL Server production cycles, and the 2016 one is different. Beyond all marketing smoke screen there are fundamental changes in SQL Server 2016 code, changes in the way the dev team operates, changes in the way features are designed.

The Azure SQL DB v12 was a radical shift in the way the Azure SQL DB product was developed and operated. Prior to this the Azure SQL code base was a derivative of the core SQL Engine code and was based on a rowset level replication technology, specific to Azure SQL DB. This is why you had restriction in Azure SQL DB that all tables must have a clustered index (heaps were not supported). With Azure SQL DB v12 the service has changed to a new technology based on physical database replication (you may know this technology as ‘mirroring’, or ‘HADRON’, or ‘AlwaysOn’, or ‘Always ON’). The Azure SQL DB service became based on the same code as the standalone product. We’re not talking even branching. The impact was immediate and profound.

With Azure SQL DB v12 the development branch of the SQL Engine core is snapped periodically and deployed as the Azure SQL DB service. As developers were working on new features targeting SQL Server 2016 release, any check-in they made was actually alive in productionAzure SQL DB service in about 2 months after the check in. And if this check-in would create any problem in production, the developer was on the hook for mitigating the issue. Azure SQL DB has no ‘ops’ team. The dev team owns the service. Prior SQL Server development cycles favored ‘feature’ branches, a code branching strategy in which new features were developed in separate branches and reverse integrated into ‘main’ branch when the feature was complete. This strategy would ‘shield’ the main branch from the effect of incomplete, partial developed features. The draw back was that features all ‘came together’ at the end of the development cycle and inevitable issues would arise. I’m not talking merge conflict, I’m talking fundamental incompatibilities of how newly developed feature X would work together with feature Y. With SQL Server 2016 this model was abandoned in favor of everybody working together on the same ‘main’ branch. This mean that every single check-in for a partial, unfinished feature had to keep the code quality at the service production code quality. Furthermore, as the code was being checked in, it had to be testable by the development team, but not visible in the live production service (which, remember, runs the very same code). Configuration ‘feature switches’ exists in Azure SQL DB which allow for selectively enabling features in production at the desired granularity (single database, account, cluster, data center, everywhere etc), but at the code level this implies strong discipline from the developers as they had to carefully choose where to place if statements in the source code to check for the feature switch value and based on that enable or disable a specific behavior.

So now you see why I’m saying that SQL Server 2016 is truly Cloud First. Every single feature in SQL Server 2016 has been in production in Azure SQL DB for likely more than a year by the time the product will be available. Perhaps some features were not ‘live’ in Azure SQL DB, but make no mistake: even those features that are not officially available today in Azure SQL DB where, very likely, tested and tried out in Azure in private mode.

A culture shift in the dev team also had to occur. Previously the dev team would work hard for 2 years, wrap up a release and ship a shrink wrapped box labeled perhaps “Microsoft SQL Server 2008R2 Enterprise Edition”. You would start deploying this in production some 6–9 months later and start reporting issues via support channels. By that time, the team moved on. You will be complaining about something a developer had worked on perhaps 3 years ago. With Azure SQL DB things are different. Not only that new features are available immediately in the service, but the dev team is also the team operating and monitoring the service. The team that developed the feature is the same team that drives the process of making the feature available in Azure, identifies possible customers for the ‘private preview’ phase, reacts to early feed back, makes the feature available in public preview mode and then general availability in Azure SQL DB. The dev team must have the capability to mitigate any problems that the feature might cause. The dev team must have the capability to understand a problem occur, preferably before the customer reports the problem. The dev team must be able to quickly shut down a feature that is causing problems, without affecting other features of the service. And the dev team must respond in minutes to incidents in production. Before this shift the development team was effectively done with a feature some months before the release of a new version of SQL Server. In the service world the development team owns the feature for as long as is available in Azure. Many SQL Azure DB features where grandfathered in from the existing SQL Server product features and have no current owning team (think things like “XML datatype”, already existing in the product long before Azure came along), but all new features are owned and operated by their development teams.

Product inefficiencies are affecting Microsoft

With the on-premise product all inefficiencies were affecting directly you, the user, and not Microsoft. In Azure SQL DB all inefficiencies are costing, very directly, Microsoft. Microsoft can charge so much for the service, and is in its best interest to make the best use of the available resource (CPU, RAM, IO). Like any business, the SQL Server development process has to strike a balance between what can be done and what has to be done, in the time available. When making the case to allocate time for a performance improvement the case is a slam dunk if the improvement affects directly density of database nodes that can be allocated on a data center. And no longer can fixes be dismissed by claiming ‘best practices’ workaround.

This story goes even further back, to the days before Azure SQL DB v12. An engineer in the database file allocation team (the team responsible with SGAM, GAM, PFS and allocation chains) has been investigating what appeared to be a leak of extents in tempdb. This caused tempdb allocations to fail and the only mitigation was to failover an entire database node, which in those days could mean hundreds of actual individual customer databases. After months of investigation the cause was tracked down to a bad practice: not all tempdb files had equal size. This is a an issue quoted as far as SQL Server 2005 Optimizing tempdb Performance: Make each data file the same size; this allows for optimal proportional-fill performance. There was no actual extent leak, it was just that the performance of the proportional-fill algorithm was so bad when the files were imbalance sized that allocations were failing! Without having the service to keep running, such an issue would had (and actually has been) dismissed as a ‘bad practice’. After seeing the bad practice actually deployed in production by none other than the SQL Server team, the fix was ‘allowed’. The proportional-fill algorithm was fixed as to no longer cause allocation failure when one file was significantly smaller than all the other files.

Azure SQL DB Query Store story

The public announcement of the Query Store availability in Public Preview for Azure SQL DB came on May 28, 2015. By that time we had already on-boarded in private preview mode some big Azure accounts, for many months prior. The team had to make some estimates about the expected workload and impact of enabling Query Store. There where some parameter values we had to tune, like how big should the Query Store storage size should be declared. We analyzed the workload and we settled on a set of values we said it will be enough to store about 2 months of query performance data with minimal impact on the workload. We enabled the feature switches and the real world hit us like a speeding train. The customers filled up their Query Store allocated size in 2 minutes. Keep in mind that the feature, by then, had been in development for years, had many unit, functional, integration, acceptance tests (all passing) and we tested it with simulated and captured workloads (replays). Yet the difference between what we planned vs. reality was 2 months vs. 2 minutes! Turns out we weren’t that bad engineers, what happened was that we hit upon an unknown bug in the SQL Server optimizer: the workload was compiling every execution of a specific statement and for each execution was generating the same plan, but with a different plan hash (signature). And caching each generated plan. 160000 plans per query (at 160k there’s a hard limit and SQL Server will no longer cache plans for that particular query). Each plan dutifully captured and saved by Query Store. So the team fixed the bug (the bug was, if I remember correctly, in how the plan hash was being computed) and give it another go. We discovered bugs in Query Store. We figured out that performance problems that we considered acceptable were crippling in real life. We hit retail assert and cause service interruption, had to mitigate at 3am. By May 2015 the team was already battle hardened and ready for public preview mode. Other teams lighted up their features and we discovered more problems. This is why today the Best Practice with the Query Store page says this: When using Elastic Pool, enable Query Store for only the small subset of databases. Enabling Query Store for all databases in an Elastic Pool can lead to excessive resource usage and could make your system unresponsive. AFAIK the problem is fixed now and the fix will soon hit the production data centers, and this note will disappear.

I don’t say that all these bugs would not have been found in the normal product beta period. Some would for sure. But I have never seen so many problems found and fixed as with the SQL Server 2016 release. And the majority of these problems were found in production, in Azure SQL DB.

And the important theme here is this: when you will purchase and install a copy of SQL Server 2016, you are actually deploying the code Microsoft has been running in production in Azure SQL DB for almost 2 years. This has never been the case with prior releases. And this is why I have high hopes that SQL Server 2016 will be of high quality.