A Summary of x86 String Instructions

I have never managed to memorize all of x86 Assembly’s string instructions — so I wrote a cheat sheet for myself. Then I thought other people may find it useful too, and so this cheat sheet is now a blog post.

This is what you’ll find here:

  1. The logic behind x86 string instructions.
  2. All the information from (1) squeezed into a table.
  3. A real-life example.

Let’s go.

Note: in order to understand this post, basic knowledge in x86 Assembly is required. I do not explain what registers are, how a string is represented in memory, etc.

The Logic

The Prefix + Instruction Combo

First, let’s make the distinction between string instructions (MOVS, LODS, STOS, CMPS, SCAS) and repetition prefixes (REP, REPE, REPNE, REPZ, REPNZ).

Repetition prefixes are meaningful only when preceding string instructions. They cause the specified instruction to repeat as long as certain conditions are met. These prefixes are also responsible for updating the relevant pointers after each iteration by the proper number of bytes.

The possible combinations of prefixes and instructions are described in the following figure.

Possible combinations of repetition prefixes (dark blue) and string instructions (light blue).
Note: I exclude the INS, OUTS string instructions as I have rarely seen them.

Termination Conditions

  • REP: repeat until ECX equals 0.
  • REPE, REPZ: repeat until ECX equals 0 or as long as the zero flag is set. The two prefixes mean exactly the same.
  • REPNE, REPNZ: repeat until ECX equals 0 or as long as the zero flag is unset. The two prefixes mean exactly the same.

String Instructions

The instruction’s first three letters tell us what it does. The “S” in all instructions stands for — how surprising — “String”. Each of these instructions is followed by a letter representing the size to operate on: ‘B’ for byte, ‘W’ for word (2 bytes) and ‘D’ for double-word (4 bytes).

Some string instructions operate on two strings: the string pointed to by ESI register (source string) and the string pointed to by EDI register (destination string):

  • MOV moves data from the source string to the destination string.
  • CMP compares data between the source and destination strings (in x86, comparison is basically subtraction which affects the EFLAGS register).
Strings pointed to by the ESI, EDI registers.

Other string instructions operate on only one string:

  • LOD loads data from the string pointed to by ESI into EAX¹.
  • STO stores data from EAX¹ into the string pointed to by EDI.
  • SCA scans the data in the string pointed to by EDI and compares it to EAX¹ (again, along with affecting EFLAGS).
Notes
1. I use EAX to refer to AL for byte operations, AX for word operations and EAX for double-word operations.
2. After each iteration, ESI and EDI are incremented if the direction flag is set, and decremented otherwise.
REPE CMPSB for Trump’s Rescue.

Cheat Sheet

Cheat sheet for x86 Assembly’s string instructions.

A Real-Life Example

Lately, we started doing CTFs at work (Trusteer, IBM Security). I stumbled upon a crack-me challenge from reversing.kr which contained the following function. Try to think about what this function is while we reverse engineer it together.

The function receives three arguments and puts the third (arg_8) in ECX. If arg_8 equals zero, the function returns.

Otherwise, we prepare the other registers for a string instruction: the first argument, arg_0, is moved into EDI and EAX is set to zero.

Now, we have a REPNE SCASB:

  • The string pointed to by EDI is scanned and each character is compared to zero, held by AL.
  • This happens until ECX equals zero or until a null-terminator is scanned.

Practically speaking, this instruction aims at finding the length of the destination string
If ECX ends up being zero (meaning a null terminator was not encountered), then ECX simply receives its original value back — arg_8
Otherwise, if the loop terminates due to a null character, ECX is set to the destination string’s length (including the null character). 
In other words, ECX is set to Min{ECX, len(destination_string)}.

Now EDI is set to arg_0 and ESI is set to arg_4, and we have REPE CMPSB:

  1. Each character pointed to by EDI is compared to the corresponding one pointed to by ESI.
  2. This happens until ECX equals zero (namely, the destination string has been fully consumed) or until the zero flag is unset (namely, until a difference between the strings is detected).

Then, the last character in the EDI string is compared to the last character in the ESI string:

  • If they are equal — the function returns zero (ECX XORed with itself).
  • If the character in [ESI-1] has a higher ASCII value than the one in [EDI-1] — the function returns 0xffffffff, or -1. This happens when the source string is lexicographically bigger than the destination string.
  • Otherwise, The function returns not 0xfffffffe, which is 1.

I reverse-engineered the function at work and then went to a colleague to see how he was doing. To my surprise, his IDA recognized this function as strncmp. My version didn’t. Argh.

strncmp displays a nice usage of string instructions which makes it a nice function for practice. In any case, now you know how strncmp is implemented.