From A to W: Character Conversion in Windows API

What For?

If you’ve ever written, read or reversed a Windows application you probably know that many Windows API functions have both an ANSI version (SomeFunctionA) as well as a Unicode version (SomeFunctionW).

Not too long ago, I read that many Windows “A” functions end up calling their corresponding “W” versions, after converting the function’s textual parameters from ANSI to Unicode. Namely, CopyFileA ends up calling CopyFileW, CreateNamedPipeA calls CreateNamedPipeW, etc.

I turned curious and wanted to see this process under the hood. I decided to dive into it and ended up (as I usually do — ) writing about it. I chose the MessageBox API to serve as my case study.

Please note:

  • Some (but not a lot of!) reverse-engineering background may help in understanding what I’m talking about.
  • I will not go as low as seeing a one-byte character converted into a two-byte character. We’ll stop at the call to MBToWCSEx and treat it as a black-box conversion function.

Preparations

My plan was to do a mix of static and dynamic analyses and therefore I needed two things:

  1. IDA’s disassembly of user32.dll (which has the MessageBox functions); and
  2. OllyDbg running a simple MessageBox program.

To accomplish the latter requirement, I compiled an extremely-simple C program which calls MessageBoxA:

Image for post
Image for post
Figure 1: The source and output of my simple MessageBox program

I opened the executable with OllyDbg, found the call to MessageBoxA and stepped into it. The function looked different than what I saw in IDA when I opened user32.dll:

I didn’t quite understand what the problem was until I figured out: I compiled the program as a 32-bit executable but ran it on a 64-bit architecture. In these cases, the program uses the WoW64 (Windows32 on Windows64) DLLs. So I opened SysWOW64/user32.dll in IDA and this time I saw identical code. 😏

Now I could finally start staring at Assembly.

The Function Call Flow

As it turns out, every version of MessageBox calls an undocumented function named MessageBoxTimeout (you can read more about it here and here). The latter receives two additional arguments— language ID and timeout — and it limits the lifetime of a Windows message box to the provided number of milliseconds.

Therefore, the body of MessageBoxA (seen in figures 2 and 3) can be summed-up using one line of code:

return MessageBoxTimeoutA(hWnd, lpText, lpCaption, uType, 0, 0xFFFFFFFF);

which basically means “call MessageBoxA with a neutral language and a timeout of about 49 days”.

The thing I was interested in to begin with was finding the ANSI-to-Unicode conversion, which did not seem to appear in this code snippet. Therefore, my next stop was MessageBoxTimeoutA.

Static Analysis of MessageBoxTimeoutA

I will walk through the function’s basic-blocks as chronologically as possible.

Block 1 (0x69E78080)

Image for post
Image for post
Figure 4: Block 1

The first block contains:

  1. The function’s prologue;
  2. Allocation of stack space for two variables (done by pushing ecx twice);
  3. Storing callee-saved registers (ebx, esi, edi); and
  4. Setting the two local variables (var_4, var_8) to zero.

Next, the lpText pointer is tested for null. Let’s assume it isn’t a null-pointer and move to the next block.

Blocks 2 (0x69E7809B) and 3 (0x69E780B6)

Image for post
Image for post
Figure 5: Block 2

In the second block we see a call to MBToWCSEx, which stands for multi-byte to wide-character string (extended). This seems to be the string-converting function I was looking for. Yay!

As I stated before, I will not reverse this function in this blog post. However, I would still like to draw its general outline:

  1. MBToWCSsEx receives, among other parameters, the source ANSI string and a stack address (leftmost illustration in figure 6).
  2. It converts this string to Unicode and stores the result in an allocated buffer on the heap (middle illustration in figure 6).
  3. It writes the address of this heap-buffer into the provided stack address.
    In our case, these stack addresses are those of the local variables var_4 and var_8. So eventually, these variables hold the heap addresses where the Unicode buffers are at (rightmost illustration in figure 6).
  4. MBToWCSsEx returns the number of characters converted, or zero in case of failure. You can either trust me on that or reverse the function yourself (which could be a good exercise!).

Now let’s finish analyzing MessageBoxTimeoutA. Take a look at the end of Block 2 (figure 5) where the return value of MBToWCSEx is tested. If it’s zero, we jump to a block which sets eax to zero and leaves the function. Otherwise, we move on to block 3 where the pointer to our newly-converted lpText string is stored in edi.

Image for post
Image for post
Figure 7: Block 3

Blocks 4 (0x69E780b9), 5 (0x69E780BE) and 6 (0x69E780E5)

Image for post
Image for post
Figure 8: Blocks 4, 5, and 6

In blocks 4–6, we see the exact same pattern only this time with lpCaption instead of lpText. First it is checked for null-pointer. We’ll once again assume a valid pointer. Then, MBToWCSEx is called with lpCaption and var_8 as parameters. Based on our knowledge, we can now rename var_8 to lpCaptionUnicode. If MBToWCSEx fails, eax is set to 0 and the function returns. Otherwise the newly-converted lpCaption is stored in esi (block 6) and we proceed to the last part of the function.

Block 7 (0x69E780E8) — The Finale

Image for post
Image for post
Figure 9: Block 7

Now, all that is left is to call MessageBoxTimeoutW with the Unicode strings instead of the original ANSI ones. Notice how the other parameters — dwMilliseconds, LanaugeID and uType — are pushed as is. Next, esi and edi are pushed, holding:

  • The address of the heap-buffers in which the new Unicode strings reside, if the conversion took place; or
  • Zero — as initialized in block 1 — otherwise.

The next instructions free the allocated heap memory and restore callee-saved registers. Finally, the function returns.

Wrapping It Up

In this hopefully-not-too-exhausting analysis, we saw the call-flow of the MessageBox API. We analyzed MessageBoxTimeoutA in order to see the place where ANSI strings are turned into Unicode strings and we learned how the output string is passed back to the caller.

If you have any questions on things I did or did not cover — don’t hesitate to leave them in the comments section below.

Written by

@ophirharpaz on Twitter. Security researcher at Guardicore. Reverse engineering enthusiast. Author of https://begin.re.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store