What happens when you don’t return from a function
I’ve been writing a hobby operating system for a while now, and I’m at the point where I can type to the screen. However, there was this intermittent bug where random cells on the screen would turn BLUE!
How could that possibly happen!? And different cells would turn blue depending on how I changed seemingly unrelated parts of my code.
I did a good-old-fashioned binary search, commenting out chunks of my very small codebase until I found the function call that caused it to happen:
The implementation of this function is in assembly:
It literally issues a single instruction, which does exactly what the name describes. What could be wrong here? Well, there’s no `ret` instruction. This function does not return to its caller. The way function calls work in x86 assembly is like so:
- The calling function issues the `call` instruction
- This pushes the return address onto the stack
- The eip register (instruction pointer) is updated to the first address of the callee
- The callee does its work.
- The callee issues the `ret` instruction
- This pops the return address off the stack and into the eip register
- eip now contains the return address, so the next instruction to be executed will be the next line in the caller
Without the `ret` instruction, nothing updates eip, so eip is instead just incremented by 1 (as per usual). The next instruction to execute is whatever happens to be next in memory, which was determined when I linked my program.
To see what instruction was next, I used `greadelf` (that’s what it’s called on OSX. It’s just called readelf on linux). That’s a program that lets you examine elf executable files, which my kernel happens to be. I ran:
greadelf kernel.elf --symbols
Which outputted a lot of lines like:
303: 001058a8 2 OBJECT GLOBAL DEFAULT 4 cursor_pos
304: 00100620 0 NOTYPE GLOBAL DEFAULT 1 interrupt_handler_160
305: 00100a34 0 NOTYPE GLOBAL DEFAULT 1 interrupt_handler_249
306: 00100974 0 NOTYPE GLOBAL DEFAULT 1 interrupt_handler_231
307: 001013d0 157 FUNC GLOBAL DEFAULT 1 kmain
308: 00100794 0 NOTYPE GLOBAL DEFAULT 1 interrupt_handler_191
309: 0010046a 0 NOTYPE GLOBAL DEFAULT 1 interrupt_handler_122
This is the symbol table. Symbols include functions (like kmain and interrupt_handler_160), variables (like cursor_pos), and a couple other things. Each line lists the memory address of the symbol. I sorted the symbols by memory address, and saw this:
00100ab2 0 NOTYPE GLOBAL DEFAULT 1 interrupt
00100ab8 0 NOTYPE GLOBAL DEFAULT 1 enable_hardware_interrupt
00100ac0 101 FUNC GLOBAL DEFAULT 1 fb_write_cell
00100b30 93 FUNC GLOBAL DEFAULT 1 clear_screen
The next thing in memory after `enable_hardware_interrupt` is `fb_write_cell`. This is a function I wrote to write text and colors to the screen. The signature of the function looks like this:
void fb_write_cell(unsigned int cell, char c, unsigned char fg, unsigned char bg)
`cell` is the index of the cell (the screen is 80 columns by 25 rows). `c` is the ascii character to display in that cell. fb and bg are the fore- and back-ground colors of the cell.
When a function executes, it assumes the stack is laid out in a certain order. `fb_write_cell` assumes the stack will look like this:
There is a return address where it would expect one, so it will return back to the caller of enable_hardware_interrupts after this function returns. However, this function is going to look for arguments just before that. Since enable_hardware_interrupt doesn’t take any arguments, that memory will be probably include local variables from it’s caller. So fb_write_cell now writes to the screen using garbage parameters, then returns.
I was confused how I could see such strange side effects, and why the results varied so much whenever I changed my code. It was probably that the stack was laid out differently, so the garbage parameters passed to fb_write_cell were different. For example, if I commented out two links in kmain, I got this:
I was also confused how my program could continue without crashing if I forgot a return statement. But if the next piece of memory is also a function, and nothing extra has been pushed onto the stack, it will use the saved return address as though it were its own!