Bugs In My Skin

November 28, 2008 12:09 AM CST by psilord in category Codecraft

Bugs In My Skin

Someone I know had asked me about an example of a bug that made me write the previous post. Here is what one of these calibers of bug I had run into looked like:

My bug was that I was debugging why a fortran program would segmentation
fault when run with an address of 0x0 and you couldn't even start it in
the debugger--meaning the debugger wouldn't even get the signal that a
segfault happened, instead saying there wasn't a process to debug.

It sorta looked like this (recreating it on a machine):         

(gdb) r
Starting program: /tmp/a.out

Program terminated with signal SIGSEGV, Segmentation Fault.
The program no longer exists.             
You can't do that without a process to debug.
(gdb)

What a pisser, eh?

An strace showed something like this:

Linux black > strace ./a.out              
execve("./a.out", ["./a.out"], [/* 78 vars */] <unfinished ...>
+++ killed by SIGSEGV @ 0x0 +++

The problem was that linux kernels (a while ago) had an undocumented
and unlogged internal limit in how big the bss section could be for a
program and exec() had gone far enough to destroy the initial process           
so it couldn't return failure. The fortran program had a gigantic
global array whose size exceeded the internal kernel limit. So when the
kernel tried to allocate the pages that the bss ELF section requested,
the kernel sent a segfault to the process--which wasn't even set up in
memory yet! Since this happened before exec() returned, you couldn't even
attach a debugger to it since the process died before the trace signal
could be sent to the debugged process and no core would be dropped since
there wasn't anything in memory to constitute a core.

I debugged it by modifying the *first* instruction of the executable
to be an illegal instruction and noticed the segfault still happened 
instead of the expected SIGILL. This told me that even before the first         
instruction of the linker loader got executed the fault was happening and
that meant the kernel was doing it and didn't like something about the          
program. From that point on, I kept bisecting the program into smaller
and smaller pieces by removing or stubbing out ELF sections until I found  
it was the bss definition which caused it. Then I looked at the bss,
noticed it wanted a 300 Meg chunk of space, and figured it out.                 

In modern kernels, they fixed it to have a much higher limit (985Megs
instead of (IIRC) ~100Megs) and changed the killing signal so gdb and strace
now tell you the process got a SIGKILL instead when it happens. Nice.

End of Line.