November 28, 2008 12:09 AM CST by psilord in category Codecraft
Bugs In My Skin
Someone I know had asked me about an example of a bug that made me write the previous post. Here is what one of these calibers of bug I had run into looked like:
My bug was that I was debugging why a fortran program would segmentation fault when run with an address of 0x0 and you couldn't even start it in the debugger--meaning the debugger wouldn't even get the signal that a segfault happened, instead saying there wasn't a process to debug. It sorta looked like this (recreating it on a machine): (gdb) r Starting program: /tmp/a.out Program terminated with signal SIGSEGV, Segmentation Fault. The program no longer exists. You can't do that without a process to debug. (gdb) What a pisser, eh? An strace showed something like this: Linux black > strace ./a.out execve("./a.out", ["./a.out"], [/* 78 vars */] <unfinished ...> +++ killed by SIGSEGV @ 0x0 +++ The problem was that linux kernels (a while ago) had an undocumented and unlogged internal limit in how big the bss section could be for a program and exec() had gone far enough to destroy the initial process so it couldn't return failure. The fortran program had a gigantic global array whose size exceeded the internal kernel limit. So when the kernel tried to allocate the pages that the bss ELF section requested, the kernel sent a segfault to the process--which wasn't even set up in memory yet! Since this happened before exec() returned, you couldn't even attach a debugger to it since the process died before the trace signal could be sent to the debugged process and no core would be dropped since there wasn't anything in memory to constitute a core. I debugged it by modifying the *first* instruction of the executable to be an illegal instruction and noticed the segfault still happened instead of the expected SIGILL. This told me that even before the first instruction of the linker loader got executed the fault was happening and that meant the kernel was doing it and didn't like something about the program. From that point on, I kept bisecting the program into smaller and smaller pieces by removing or stubbing out ELF sections until I found it was the bss definition which caused it. Then I looked at the bss, noticed it wanted a 300 Meg chunk of space, and figured it out. In modern kernels, they fixed it to have a much higher limit (985Megs instead of (IIRC) ~100Megs) and changed the killing signal so gdb and strace now tell you the process got a SIGKILL instead when it happens. Nice.
End of Line.