From A to W: Character Conversion in Windows API

Ophir Harpaz
Nov 21, 2018 · 6 min read

What For?

If you’ve ever written, read or reversed a Windows application you probably know that many Windows API functions have both an ANSI version (SomeFunctionA) as well as a Unicode version (SomeFunctionW).

Not too long ago, I read that many Windows “A” functions end up calling their corresponding “W” versions, after converting the function’s textual parameters from ANSI to Unicode. Namely, CopyFileA ends up calling CopyFileW, CreateNamedPipeA calls CreateNamedPipeW, etc.

I turned curious and wanted to see this process under the hood. I decided to dive into it and ended up (as I usually do — ) writing about it. I chose the MessageBox API to serve as my case study.

Please note:

  • Some (but not a lot of!) reverse-engineering background may help in understanding what I’m talking about.
  • I will not go as low as seeing a one-byte character converted into a two-byte character. We’ll stop at the call to MBToWCSEx and treat it as a black-box conversion function.

Preparations

My plan was to do a mix of static and dynamic analyses and therefore I needed two things:

  1. IDA’s disassembly of user32.dll (which has the MessageBox functions); and
  2. OllyDbg running a simple MessageBox program.

To accomplish the latter requirement, I compiled an extremely-simple C program which calls MessageBoxA:

Figure 1: The source and output of my simple MessageBox program

I opened the executable with OllyDbg, found the call to MessageBoxA and stepped into it. The function looked different than what I saw in IDA when I opened user32.dll:

Figure 2: MessageBoxA looks different on OllyDbg (left) and IDA (right).

I didn’t quite understand what the problem was until I figured out: I compiled the program as a 32-bit executable but ran it on a 64-bit architecture. In these cases, the program uses the WoW64 (Windows32 on Windows64) DLLs. So I opened SysWOW64/user32.dll in IDA and this time I saw identical code. 😏

Figure 3: Looks similar this time. Phew

Now I could finally start staring at Assembly.

The Function Call Flow

As it turns out, every version of MessageBox calls an undocumented function named MessageBoxTimeout (you can read more about it here and here). The latter receives two additional arguments— language ID and timeout — and it limits the lifetime of a Windows message box to the provided number of milliseconds.

Therefore, the body of MessageBoxA (seen in figures 2 and 3) can be summed-up using one line of code:

return MessageBoxTimeoutA(hWnd, lpText, lpCaption, uType, 0, 0xFFFFFFFF);

which basically means “call MessageBoxA with a neutral language and a timeout of about 49 days”.

The thing I was interested in to begin with was finding the ANSI-to-Unicode conversion, which did not seem to appear in this code snippet. Therefore, my next stop was MessageBoxTimeoutA.

Static Analysis of MessageBoxTimeoutA

I will walk through the function’s basic-blocks as chronologically as possible.

Block 1 (0x69E78080)

The first block contains:

  1. The function’s prologue;
  2. Allocation of stack space for two variables (done by pushing ecx twice);
  3. Storing callee-saved registers (ebx, esi, edi); and
  4. Setting the two local variables (var_4, var_8) to zero.

Next, the lpText pointer is tested for null. Let’s assume it isn’t a null-pointer and move to the next block.

Blocks 2 (0x69E7809B) and 3 (0x69E780B6)

In the second block we see a call to MBToWCSEx, which stands for multi-byte to wide-character string (extended). This seems to be the string-converting function I was looking for. Yay!

As I stated before, I will not reverse this function in this blog post. However, I would still like to draw its general outline:

  1. MBToWCSsEx receives, among other parameters, the source ANSI string and a stack address (leftmost illustration in figure 6).
  2. It converts this string to Unicode and stores the result in an allocated buffer on the heap (middle illustration in figure 6).
  3. It writes the address of this heap-buffer into the provided stack address.
    In our case, these stack addresses are those of the local variables var_4 and var_8. So eventually, these variables hold the heap addresses where the Unicode buffers are at (rightmost illustration in figure 6).
  4. MBToWCSsEx returns the number of characters converted, or zero in case of failure. You can either trust me on that or reverse the function yourself (which could be a good exercise!).
Figure 6: Address-saving mechanism in MBToWCS. In our case, Local Variable is either var_4 or var_8.

Now let’s finish analyzing MessageBoxTimeoutA. Take a look at the end of Block 2 (figure 5) where the return value of MBToWCSEx is tested. If it’s zero, we jump to a block which sets eax to zero and leaves the function. Otherwise, we move on to block 3 where the pointer to our newly-converted lpText string is stored in edi.

Figure 7: Block 3

Blocks 4 (0x69E780b9), 5 (0x69E780BE) and 6 (0x69E780E5)

Figure 8: Blocks 4, 5, and 6

In blocks 4–6, we see the exact same pattern only this time with lpCaption instead of lpText. First it is checked for null-pointer. We’ll once again assume a valid pointer. Then, MBToWCSEx is called with lpCaption and var_8 as parameters. Based on our knowledge, we can now rename var_8 to lpCaptionUnicode. If MBToWCSEx fails, eax is set to 0 and the function returns. Otherwise the newly-converted lpCaption is stored in esi (block 6) and we proceed to the last part of the function.

Block 7 (0x69E780E8) — The Finale

Figure 9: Block 7

Now, all that is left is to call MessageBoxTimeoutW with the Unicode strings instead of the original ANSI ones. Notice how the other parameters — dwMilliseconds, LanaugeID and uType — are pushed as is. Next, esi and edi are pushed, holding:

  • The address of the heap-buffers in which the new Unicode strings reside, if the conversion took place; or
  • Zero — as initialized in block 1 — otherwise.

The next instructions free the allocated heap memory and restore callee-saved registers. Finally, the function returns.

Wrapping It Up

In this hopefully-not-too-exhausting analysis, we saw the call-flow of the MessageBox API. We analyzed MessageBoxTimeoutA in order to see the place where ANSI strings are turned into Unicode strings and we learned how the output string is passed back to the caller.

If you have any questions on things I did or did not cover — don’t hesitate to leave them in the comments section below.

Ophir Harpaz

Written by

Security researcher at Guardicore. Reverse engineering enthusiast. Author of https://begin.re.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade