Reading from a file, pipe or socket is not the same

Benoît Sevens
May 29, 2019 · 10 min read

I was playing with one of many awesome CTF challenges made by j00ru, called “antipasto”, when I noticed some behavior I could not explain.

If you would like to solve this challenge without any spoilers, then stop reading right now, because this blog post will spoil it a tiny bit.

A big thanks goes out to j00ru for (besides creating the challenge) motivating and helping me find the explanation for the weird behavior I am about to describe.

I have learned interesting stuff about the Linux kernel (it was my first time looking at it), and I hope you will too!

Context

The challenge can be solved by knowing that reading from a file descriptor into memory that is not large enough to accomodate the number of requested bytes, the read will return -1.

Let’s clarify this with an example. Suppose somewhere in a C program you have this:

read(0, p, 1000);

This line tries to read 1000 bytes from STDIN (i.e. file descriptor 0) into a pointer p. Now suppose also that p points to mapped memory, but memory is unmapped starting at, let’s say, p+500. In this case, read will return -1. Knowing this, you can exploit a bug in the program and solve the challenge (which I won’t detail here).

However, there is one caveat to add. read only returns -1 if the file descriptor (first argument) you are reading from, represents a socket!

j00ru clearly states in his setup instructions for the challenge that the challenge is supposed to run via netcat or something else, but in any case over the network. I, however, tried to exploit the binary via a pipe. While doing this, I noticed read did not return -1. I wondered why.

As you will see, the problem will seem simple, but the explanation for me was not straightforward. I learned a lot along the way about file descriptors, the Linux kernel source code, strace and Systemtap.

Simplified problem statement

Suppose you have a program like this:

which you compile into a program called test:

# g++ -o test test.cpp

As you can see, this program tries to read 0x3000 (or 12288 in decimal) bytes from STDIN (file descriptor 0) into a pointer p and prints how many bytes were read. However, memory is unmapped at p+0x1800. So, maximum 0x1800 (or 6144 in decimal) bytes can actually be read.

Now let’s feed this program the bytes it requests, via 3 different ways:

  • Via a pipe:
# python3 -c 'print("A"*0x3000)' | ./test
4096
  • Via a file:
root@debian:~/j00ru# python3 -c 'print("A"*0x3000)' > input.bin; ./test < input.bin
6144
  • Or via a socket:
# nc -l -p 1337 -e ./test &
[1] 9098
# python3 -c 'print("A"*0x3000)' | nc localhost 1337
-1

In each case, the result is different! Only in the last case (of a socket), does read effectively return -1. In the case of a pipe, read returns 4096 and in the case of a file, read returns 6144. Hm…

Let’s study each case separately and try to explain the why.

And oh yeah, the OS version I am playing on is:

# uname -a
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux

The file descriptor represents a pipe

First we strace the process:

  • A pipe is created for the IPC between python and test
  • bash clones itself twice, once for python and once for test
  • The read end of the pipe is connected to the STDIN of the test clone
  • The test clone loads the test image with an execve
  • The write end of the pipe is connected to the STDOUT of the python clone
  • test performs its read call, which is blocking, because the pipe is empty. Indeed, man 7 pipe states this:

If a process attempts to read from an empty pipe, then read(2) will block until data is available.

  • python writes 0x3000 bytes to the pipe
  • read returns with a value of 0x1000

Now that we have a good idea of what happens in userland, we can turn to kernelland to search for the answer. Let’s see how the read system call is performed in the kernel sources.

The definition of this system call can be found in fs/read_write.c:

We see that it calls vfs_read, defined in the same file:

The most important part is where it calls the read member of the f_op member of the file. So the actual read function implementation is stored in a file_operations structure. The structure is separately defined for each file type.

This means that each file type will have its own implementation and code. As we will see, this is the basis for explaining why we got different results.

In this case, for pipes, we find its definition in fs/pipes.c:

We see that for reading, it calls into new_sync_read, which is defined in fs/read_write.c:

This function calls the read_iter function of the file_operations structure, which for pipes is pipe_read, defined in fs/pipe.c :

Finally, pipe_read calls copy_page_to_iter , which will copy the data to userland.

To summarize:

  • It starts with aread system call
  • Which calls into vfs_read
  • Which calls into new_sync_read
  • Which calls into pipe_read
  • Which calls into copy_page_to_iter to copy the data to userland

How can we trace these function calls? For this, Systemtap is an excellent and popular tool. As the project page states:

SystemTap provides a simple command line interface and scripting language for writing instrumentation for a live running kernel plus user-space applications.

By the way, if you wonder why we do not just use a debugger, ask Linus.

A very good introduction on how to set Systemtap up and write your first scripts can be found here. Let’s write a Systemtap script that will hook the entry and exit of each one of these functions, and display some arguments and return values:

When running this script in one console and executing the program in another console:

# python -c 'print("A"*0x3000)' | ./test

we get the following output:

>>> read (0, 0x7fdc8af48800, 12288)
>>> vfs_read ()
>>> new_sync_read ()
>>> pipe_read ()
>>> copy_page_to_iter (bytes=0x1000)
<<< copy_page_to_iter = 0x1000
>>> copy_page_to_iter (bytes=0x1000)
<<< copy_page_to_iter = 0x800
<<< pipe_read = 0x1000
<<< new_sync_read = 0x1000
<<< vfs_read = 0x1000
<<< read = 0x1000

Interesting! copy_page_to_iter is called twice. The second time it returns 0x800, as it can only write 0x800 bytes before hitting unmapped memory. Then pipe_read returns 0x1000 (and not 0x1800). Let’s look at the source code of pipe_read :

This eternal forloop calls copy_page_to_iter repeatedly. If the number of bytes written is less than the number of bytes requested to be written, the loop exits. The first time this is not the case, so ret is incremented with chars (i.e. 0x1000). The second time, written (i.e. 0x800) is less thenchars (i.e. 0x1000), so the loop exits. However, ret is not incremented!

This is why 0x1000 is returned, and not 0x1800. However, if we understand this correctly, in reality 0x1800 bytes are written (although read will return 0x1000 to us). Let us check this to confirm we got things right with the following little program:

When running this code:

# python -c 'print("A"*0x3000)' | ./test2
read return value: 0x1000
number of bytes written to memory: 0x181a

Yep, we can confirm! read returns 0x1000, but actually 0x1800 bytes are written to memory.

The attentive reader might be wondering why the strlen function:

  • Doesn’t crash (because it reads further into unmapped memory, you’d think)
  • Returns a little bit more than 0x181a

What happens is that in my “test2” program, the first printf call calls somewhere an mmap. This is the call stack in a debugger at that moment:

(gdb) bt
#0 mmap64 () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007ffff727853c in __GI__IO_file_doallocate (fp=0x7ffff75b52a0 <_IO_2_1_stdout_>) at filedoalloc.c:128
#2 0x00007ffff728518c in __GI__IO_doallocbuf (fp=fp@entry=0x7ffff75b52a0 <_IO_2_1_stdout_>) at genops.c:418
#3 0x00007ffff7284658 in _IO_new_file_overflow (f=0x7ffff75b52a0 <_IO_2_1_stdout_>, ch=-1) at fileops.c:829
#4 0x00007ffff7283951 in _IO_new_file_xsputn (f=0x7ffff75b52a0 <_IO_2_1_stdout_>, data=<optimized out>, n=21) at fileops.c:1324
#5 0x00007ffff725528d in _IO_vfprintf_internal (s=0x7ffff75b52a0 <_IO_2_1_stdout_>, format=0x400888 “read return value: 0x%x\n”, ap=ap@entry=0x7fffffffeb38)
at vfprintf.c:1323
#6 0x00007ffff725fd89 in __printf (format=<optimized out>) at printf.c:33
#7 0x00000000004007b0 in main () at test2.cpp:12

We can also see this in thestrace:

munmap(0x7fb91a31d000, 4096) = 0
read(0, “AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA”…, 12288) = 4096
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), …}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb91a31d000
write(1, “read return value: 0x1000\n”, 26read return value: 0x1000) = 26
write(1, “number of bytes written to memor”…, 42number of bytes written to memory: 0x181a) = 42

This mmap reallocates the previously unmapped memory. This is why strlen does not crash! Why does printf call mmap? This memory is used by printf to construct the filled in format string “read return value ….”, which (after being filled in) is 26 bytes long.

So when calling strlen on p, in memory we have 0x1800 “A”’s followed by 26 bytes, which gives a total of 0x181a bytes.

The file descriptor represents a file

We will speed things up slightly here, because the analysis is very similar to the previous case.

The command we will be studying is:

# python -c 'print("A"*0x3000)' > input.bin; ./test < input.bin
6144

If you do an strace on this command, you will see that:

  • bash opens the file on disk input.bin
  • clones itself for the ./test program
  • duplicates the file descriptor of the open file to STDIN of the cloned process
  • loads the ./test program with an execve

So here, STDIN comes from a file, not a pipe or a socket.

When analysing the kernel sources in a similar way as for the previous case, we see that:

  • It starts with aread system call
  • Which calls into vfs_read
  • Which calls into new_sync_read
  • Which calls into generic_file_read_iter
  • Which calls into do_generic_file_read
  • Which calls into copy_page_to_iter to copy the data to userland

We make a new Systemtap script to trace these functions and run it:

The output is:

>>> read (0, 0x7fef31422800, 12288)
>>> vfs_read ()
>>> new_sync_read ()
>>> generic_file_read_iter ()
>>> do_generic_file_read ()
>>> copy_page_to_iter (bytes=0x1000)
<<< copy_page_to_iter = 0x1000
>>> copy_page_to_iter (bytes=0x1000)
<<< copy_page_to_iter = 0x800
<<< generic_file_read_iter = 0x1800
<<< new_sync_read = 0x1800
<<< vfs_read = 0x1800
<<< read = 0x1800

(There is a caveat here: do_generic_file_read can not be traced on its return because it is inlined. This is a limitation of Systemtap).

So here, generic_file_read_iter (and thus do_generic_file_read ) returns 0x1800. Let’s look at the source code to find out why. This snippet is from the do_generic_file_read function in the for(;;) loop:

Again, we test if the number of bytes copied (i.e. ret ) is smaller than the number requested to be copied (i.e. nr ). If so, we exit the for loop. If not, we continue the loop. Now before testing this, ret is added to written and it is written that gets returned at the end of the function.

So the difference with the pipe case, is that in this case all bytes written are returned. This explains why we had a return value of 0x1800!

The file descriptor represents a socket

Again, we will go fast here because the analysis is very similar.

We start with an strace:

In summary:

  • nc listens on a socket
  • When a client connects, it accepts the connection.
  • The file descriptor of the socket is duplicated to the STDIN and STDOUT of the process.
  • nc replaces itself via an execve with ./test

So we see that in this case, STDIN corresponds to a socket.

Again, we have to find the functions that are called in the kernel, which is a little bit more difficult than in the previous case. This is left as an exercise to the reader. The functions can be found in our new Systemtap script:

The execution of this Systemtap script gives the following output:

So the read call is failing with -EFAULT because it tries to copy 0x2000 bytes to userland (by calling 2 times memcpy_toiovec ) in skb_copy_datagram_iovec.

Although the system call returns -1, 0x1800 bytes have actually been written to the userland buffer!

Conclusion

What we see from userland as a single system call (here: read) is actually implemented in different functions and (very) different ways depending on the underlying file type (file, pipe or socket) the file descriptor is referring to.

We also noticed that read in some cases does not return the actual number of bytes written to user land, which was in all 3 cases 0x1800.

Interestingly, in the case of the pipe, it returned a smaller number than the number of bytes actually written to memory. I think this is something a lot of programmers probably don’t know. If a programmer makes false assumptions based on the return value of the read call, could bad things happen from this?

@j00ru, an inspiration for a new challenge?

Other interesting fact

If you use ncat instead of nc.traditional, you will not get -1 returned, but 4096. This is because ncat does not connect a socket to the STDIN of the program, but a pipe. You can find more on this in the source code of the ncat program.

This means that if you want to play with the antipasto challenge, do not use ncat, but nc.traditional!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade