When I investigate a bug, I much prefer that its behavior is consistent and repeatable. If I apply the same inputs again, I get the same results. I know that any changes in the output can only have come from changes that I have made. It gives me confidence because I can understand the failure and develop a strategy to fix it.
Imagine, then, that there is a problem that only seems to manifest on one day of the week. Wouldn’t that be weird? Let me tell you a story about a set of circumstances that led to some of our tests failing, but only on Tuesdays.
A timeline of events
One Tuesday evening, we saw that some of our tests were now failing when run against a particular SQL Server instance. We knew that this instance was due to be rebuilt later the same night, so we planned to check again in the morning to see if the tests had passed. Brilliant! They were passing! The rebuild seemed to fix the problem. It must have been a transient fault, we assumed.
One week later, the mystery deepened. The same tests were failing again. Once again, it was a Tuesday. It might have been a coincidence, but we weren’t so sure. It seemed less of a random hiccup and more like an actual problem that needed fixing. A hypothesis should enable one to make predictions, and we predicted that this would happen on the following Tuesday. Yep, it happened again! Everyone in the team now agreed there was a problem.
Our tests run against SQL Server instances, and they run a script beforehand to create some objects that are necessary for the tests to function properly. These tests are run simultaneously against several SQL Server instances, so we can’t hard-code the machine name into the script. Instead, we get the machine name by querying the MachineName SQL Server property.
The problem was that SQL Server returned the MachineName property in upper case, except for Tuesdays when it would return it in lower case.
The SQL Server instance in question is case sensitive, and this led to our script determining that some logins didn’t exist because of the casing difference, then failing to create them because they do exist in reality.
What caused the changing case?
We discovered that it was down to a single security update for SQL Server, KB4583459. When installed, our SQL Server instance started returning the MachineName server property in lower case rather than in upper case.
But why Tuesdays?
We use Puppet to manage our SQL Server instances and the instance in question gets automatically rebuilt in the early hours of Wednesday morning. Somehow, automatic updates were configured to be installed on Tuesday mornings. This included the update that caused the problem.
The environment remained unchanged between Wednesday morning and Monday evening, so the behavior was consistent. We then got one day of installed updates before the environment reset, and thus one day of lower case before normal service resumed on Wednesday.
It doesn’t matter if the MachineName property is being returned in lower case or upper case, as long as the behavior is consistent. So it was a trivial fix to force the case in our test code. Tuesday is now just a regular day of the week!
How baffling was this?
Very! But I have to acknowledge the IT Operations team at Redgate, who helped immensely. It would have taken a lot longer without their knowledge and insights.