Using Cobra to detect the infamous sscanf bug

Thomas DuBuisson
Musings
Published in
3 min readMar 9, 2021

--

A wildly popular reverse engineering of GTA showed how a poorly written JSON parser had quadratic complexity costing minutes more in load time. Not long after, Matt Keeter discusses the same bug in their project with a message about staying humble because, “it could happen to you”. But you can do better than be humble, you can use the Cobra tool to detect this pattern in your codebase and prevent it from ever reaching master.

Performance? Wadaya gonna do with those extra minutes anyway?

The first blog is a great spelunking which finds a performance bug. The author hacks in a fix to prove their point — it’s just a great article. At its core, the issue is a for loop that steps through a buffer and calls strlen on the buffer. The computational cost is quadratic. If the problem seems obvious, then replace the above mention of strlen with sscanf. Yes, the sscanf function actually checks the length of the input and was the problem.

Boiling down the original problem to the key components: there is a buffer, a loop, and a call to sscanf on the buffer. That is:

function() {
type *data = init();
for(..) {
sscanf(data, ...);
data = data + n;
}
}

Each loop advances the data pointer and calls sscanf. Each call then checks the length of data. If the original length is 10 and advances by just one byte then it computes lengths of strings 10, 9, 8, 7 … this gets time consuming fast (err, slow?).

Beyond “Being Careful”

Keeter tells us to be humble and careful. As programmers let’s also automate. You might not remember the particulars of sscanf, or notice the loop patten while refactoring someone else’s code but an automatic tool can save you. You can automatically look for this pattern in the C code using the Cobra tool, which lets you essentially grep through the C code in a syntax-aware way. Cobra tokenizes the code and runs regular-expressions over the stream of tokens. For the sscanf bug you can use the pattern:

{ .* x:@ident .* ( .* ) { .* sscanf ( :x , .* ) .* } .* }

Breaking it down:

  1. Within a code block skip any number of tokens ({ .* … })
  2. Match an identifier ( x:@ident )
  3. Skip tokens ( .* )
  4. Match parentheses with any number of tokens in between ( ( .* ) )
  5. Implicitly looking for loops by matching a code block immediately after the parentheses ( { … } )
  6. Look for sscanf on the identifier in the new code block ( .* sscanf ( :x ,…)

To try it manually just install Cobra and run:

cobra -json -pe \
'{ .* x:@ident .* ( .* ) { .* sscanf ( :x , .* ) .* } .* }'

I don’t happen to have the GTA source sitting around, but Keeter’s code — from the second blog — is available. Let’s do it!

$ git clone https://github.com/TomMD/erizo
$ cd erizo
$ git checkout introduce-stl-support
$ cobra -json -pe \
'{ .* x:@ident .* ( .* ) { .* sscanf ( :x , .* ) .* } .* }' \
$(find -name '*.c')

Bingo!

[
{ "type" : "{ .* x:@ident .* ( .* ) { .* sscanf ( :x , .* ) .* } .* }",
"message" : "lines 52..86",
"file" : "./src/loader.c",
"line" : 52
}
]

With a tad bit of plumbing you can setup CI to make this sscanf nightmare a thing of the past.

--

--

Thomas DuBuisson
Musings

With a background in security, cryptography, and protocol correctness I’m now working (with muse.dev) to make static analysis more mainstream and accessible.