Unknown unknowns

Leon Fayer
homo technologicus’ asylum
5 min readApr 10, 2020

--

Havana, Cuba © Leon Fayer

Perhaps one of the all time favorite talks I’ve ever given was Production Testing through Monitoring (for those who understand Russian, there is a longer variation of this talk focused on measuring business success). The talk’s central theme is a promotion of business-first monitoring with upwards correlation of supporting technical metrics, but I’ve talked plenty about it over the years and that’s not what prompted today’s rant. Today, I want to revisit the first part of the talk, the part where I make the argument that no matter how “complete” your test suite is, you will never catch 100% of bugs.

Last weekend, we had scheduled a major product deployment. It was a multi-month effort to rebuild the existing platform (more on challenges and successes later) with multiple levels of QA validation (from manual testing to automatic tests) running for weeks prior to deployment, validating both incremental changes and doing the full regression. There are a lot of interesting (read: amusing/horrifying) stories that can be told about that project, but this particular story begins about 12 hours before the start of the go-live sequence.

T-11 hours: To count or not to count

Anyone who ever deployed a major product knows that the last day is pure chaos. Branches are reconciled, testers frantically finishing their tasks, last minute issues are being force-merged and getting ready to go. In the midst of that chaos we received a report from a different product team that a very isolated, yet very prominent, piece of functionality stopped working. The function is really straight forward INSERT INTO some_table VALUE (….) RETURNING table_id . Mind you, the table (and the code touching the table) was not used by the new product. To double and triple check, the team looked at every file committed last day and nothing was even remotely related to this functionality. Reverting to first non-working push, the team huddle up and went OG — commenting code block at a time to isolate where the issue is coming from. Good new — the root cause was found relatively easily. Bad news — it made no sense.

Issue was traced to a function, which also performed a query. Which was in no way related to the database objects in now-broken function. The gut instinct, as always, was to make sure transaction is honored, but the transactional integrity was maintained. On top of that, both queries run just fine by hand. That’s it. There was nothing else to comment out.

It did take a bit for someone to take a shot in the dark and ask to comment out SET NOCOUNT ON in the query itself. For those who don’t know, that command is used to allow/deny return of data from SELECT queries, i.e. 12345 rows returned . It is used for performance reasons, since you’re reducing network traffic to return information on no-op queries. This should’ve not have any impact on a query from a different part of the system. But it did, because NOCOUNT ON prevents all return data, including values coming back from RETURNING which, in this case, is an id that is used further in the code.

But how did setting a flag in one part of the code impact a query in another? Apparently, Java-based framework used, set NOCOUNT flag for the connection not transaction. And since we’re using connection pooling, as soon as the new query starts executing, connections inherit NOCOUNT flag. :facepalm:

T-2 hours: There is never enough hours in a day

At about 8:10 pm, as developers were scrambling to cross all the t’s and dot all the i’s and testers were running final smoke tests, one of the testers noticed that they error is thrown when trying to logout. Keep in mind, logout has been tested thoroughly up to that point (including the same day) and no changes to the code path has been made in a while. Strange, but wait, we have logs!

{
"dt": "2020-04-03T20:53:51",
"http": {
"method": "GET",
"script": "/logout",
"querystring": ""
},
"user_id": 123456,
"message": "Application onError",
"extended": {
"referral": "/abc",
"error": {
"type": "IdentityManager.logout",
"detail": "An exception occurred in an Application event handler: onRequest",
"message": "Could not logout completely",
"errorcode": "",
"extendedinfo": "{\"userID\":123456,\"effectiveOn\":\"2020-04-04T24:53:50Z\"}"
},
"name": "onRequest"
}
},
"level": "ERROR"
}

At a first glance everything looks kosher. Except, under really close inspection, one thing. The date. 2020–04–04T24:53:50Z .

Go ahead, I’ll wait while you figure it out.

I’ll give you a benefit of the doubt and assume you’ve noticed that that particular day has 25 hours in it….

Looking at the code, the formatting string looked like this: var t = timeFormat(utc, “kk:mm:ssZ”); Since I’ve never seen kk syntax in any of the dozen of programming languages I had worked with, I went to the source (i.e. the docs).

[hh] Hour in day (0-23)[kk] Hour in day (1-24)

Given that all the timestamps are UTC, this issue could only be observed between the hours of 8 and 9pm ET (after daylight savings time), time when (most) testers are home, enjoying an adult beverage. We lucked out that the night before everyone was testing into the evening. Granted, we would’ve likely caught it with monitors but it certainly wouldn’t get caught by any reasonable test suite.

T+25 hours: the return of Little Bobby Tables

Exploits of a Mom — https://xkcd.com/327/

The launch went well. Surprisingly well. But I am always worried when things work the first time, so the whole team was glued to the monitors the next day, checking all the vitals. That’s why the lone error did not go unnoticed in a stream of production data.

{
error: "cannot unmarshal bool into Go struct field .firstName of type string”,
“time”: “2020–04–04T22:06:12–04:00”,
“message”: “error getting view”
}

The comic above was a clue. Apparently, there are users in the database with first name “True”, which, as somewhat expected, get interpreted as a boolean in code. This, of course, has a potential for more serious implications, but the point illustrated by this pearl is that it’s difficult to plan for all data cases upfront.

As I emphasized in my talk, in no way am I preaching for less testing. But I am trying to increase awareness if you will, that not everything can (and will) be caught by testing, especially on first go-around. Supplement your test suits with deep monitoring to detect anomalies and continuously improve your test coverage. Do both. Do them well.

--

--

Leon Fayer
homo technologicus’ asylum

Technologist. Cynic. Of the opinion that nothing really works until it works for at least a million of users.