< Previous | Next >
May 21, 2005 1:44 AM CDT by psilord in category Idiocy

G Plus Plus Minus One

I hate compilers.

I'm responsible for the porting of Condor to many different flavors and revisions of OS. It is a challenging job in most respects that Sisyphus would understand--though I do love it since it hones my technical skills for use in other areas of my life. I spend a lot of time with different revisions of the GNU Compiler Collection, the system programming APIs to a lot of OSes--especially Linux, and know a fair amount of how vendor compilers and C preprocessors do their job. The one pervading lesson that I have learned is that people who write compilers probably don't use them.

For example, good old GNU g++ likes to put -lstdc++ (among other things) at the end of the compile line like this (on a Redhat 7.2 x86 box while compiling "Hello World"):

Linux rh7.2 > g++ -v hello.C -o hello

[ snip tangential garbage ]

/usr/lib/gcc-lib/i386-redhat-linux/2.96/collect2 -m elf_i386 \
-dynamic-linker /lib/ld-linux.so.2 \
-o foo \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crt1.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crti.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtbegin.o \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96 \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../.. \
/tmp/ccnZj0aB.o \
-lstdc++ -lm -lgcc -lc -lgcc \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtend.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crtn.o

One might think this is exactly what you want since you shouldn't have to figure out and supply that crap at the end of the line for symbol resolution. And you'd be right, you shouldn't have to figure it out.

But, here we teeter on the brink of idiocy. This is pretty much a catastrophic failure when you deal with binary compatibility between, as if it matters, Linux distributions (which I'll assume for the rest of this post). First off, my stupid little hello.C program requires both the gcc and C++ runtimes (in the rh72 example, there is no shared gcc runtime, but it will show up later in gcc's evolution) as shared libraries which tie the executable to specific versions of the compiler revision's libraries. See:

Linux rh7.2 > ldd ./hello
        libstdc++-libc6.2-2.so.3 => /usr/lib/libstdc++-libc6.2-2.so.3 (0x40033000)
        libm.so.6 => /lib/i686/libm.so.6 (0x40076000)
        libc.so.6 => /lib/i686/libc.so.6 (0x40099000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

The libraries as they stand implement all sorts of goo (to make things like dynamic_cast function) and for the most part have completely opaque implementations to the end user. However, those implementations have functions in them, and functions are the domain of the linker loader during executable runtime. This is where the problem begins to show itself.

It turns out, that if you take the above program and run it on a different Linux distribution, suppose SuSE 8.1, it'll work just fine.

Or does it?

If my little C++ program uses a tiny subset of C++, say no exceptions, run time type information, or STL, it'll probably work just fine. However, suppose I make my C++ program a little more complicated by adding in a correct use of dynamic_cast and recompile it on the rh7.2 box. What happens when I move it to the SuSE box?

Linux SuSE 8.1 > ./hello
./hello: relocation error: ./hello: undefined symbol: __dynamic_cast_2
Uh Oh! What happened? What happened was that the opaque runtime layer blew up because the dynamic linker loader couldn't figure out how to resolve this internal function at runtime which changed between stdc++ internal runtime revisions between the stdc++ library it was linked against and the library it found during execution on the different machine. That's right, my program could have been happily running for days until it decided to do a dynamic_cast and BAM it gets shot right between the eyes. This implies that maybe the rest of the program might be subtlely producing incorrect information, or not, it is undefined. However, I only noticed this after adding a slightly more complex feature of C++ which turned on a mishmash of internal behavior.

So, how do we fix this to achieve binary compatibility? Three options: 1) Remove the dynamic_cast, 2) produce a statically linked executable, or 3) statically link in only the gcc and c++ runtime libraries while leaving everything else dynamically linked, and 4) recompile. I definitely know option 2 is stupid since you can kiss goodbye NSS lookups beyond 'files', option 1 is appealing to me, but due to some strange twist of fate it isn't chosen, option 4 is out of the question since not only would that mean I'd have to port 400,000 lines of often deeply magical code to a new compiler, but also the 9+ million lines of external third party libraries(like kerberos)--to 28+ different architectures. Option 3 becomes the winner, mostly through forfeit of the other options.

So, let's try the obvious:

Linux rh7.2 > g++ -v hello.C -o hello -Wl,-Bstatic -lstdc++

[ snip extraneous junk ]

/usr/lib/gcc-lib/i386-redhat-linux/2.96/collect2 -m elf_i386 \
-dynamic-linker /lib/ld-linux.so.2 \
-o hello \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crt1.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crti.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtbegin.o \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96 \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../.. \
/tmp/ccUrafey.o \
-Bstatic -lstdc++ -lstdc++ -lm -lgcc -lc -lgcc \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtend.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crtn.o

Linux rh7.2 > ldd ./hello
	not a dynamic executable
Oops. What happened? Well, if you look carefully, the stdc++ I added after the -Wl,-Bstatic is present, but then so are the compiler supplied libraries after it. Since -Wl,Bstatic is a stateful flag, it turns of dynamic linking for everything after it, so not only do I get my requested static linkage of stdc++, I also get unrequested static linkage of libc and libm. Kiss NSS good bye.

Ok, what if I get smart and turn back on dynamic linking at the very end of the link line? I would do this with the fool notion in my head that since I'm resolving all dependancies in libstdc++ statically with the object files beforehand, the compiler wouldn't bring in the dynamic version of the libstdc++ since it wouldn't be needed. Let's see what happens:

Linux rh7.2 > g++ -v hello.C -o hello -Wl,-Bstatic -lstdc++ -Wl,-Bdynamic

[ snip extraneous junk ]

/usr/lib/gcc-lib/i386-redhat-linux/2.96/collect2 -m elf_i386 \
-dynamic-linker /lib/ld-linux.so.2 \
-o hello \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crt1.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crti.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtbegin.o \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96 \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../.. \
/tmp/ccUrafey.o \
-Bstatic -lstdc++ -Bdynamic -lstdc++ -lm -lgcc -lc -lgcc \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtend.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crtn.o

Linux rh7.2 > ldd ./hello
        libstdc++-libc6.2-2.so.3 => /usr/lib/libstdc++-libc6.2-2.so.3 (0x40033000)
        libm.so.6 => /lib/i686/libm.so.6 (0x40076000)
        libc.so.6 => /lib/i686/libc.so.6 (0x40099000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

Um. WTF! In a way, this is totally unexpected and now I have no idea what is actually in my executable. Do I have two competing version of libstdc++? How do they interact while running in a binary compatible situation (two different versions of libstdc++ playing in the same process)? This is a catastrophe. This is about 50% of the insidious idiocy about this topic of which I speak.

Ok, I figured out this is terrible so I figure I need to turn off bringing in of the compiler defined libraries. I find an option: -lnostdlib. Jeez. I hope you didn't need crt1.o or anything like that since not only does this option get rid of the appending libstdc++ and friends, it gets rid of everything else supplied by the compiler as well. In short there is absolutely no method of turning off the stdc++ and gcc runtime inclusion but still keeping enough low level objects (like crtn.o) there to produce an executable.

This leaves two options: 1) Only use gcc to link, or 2) write our own ld script which does the right thing.

Option 1 is laughable from a user's point of view. "You mean to tell me I cannot use g++ to link my objects when I not only compiled all of my software with it, but all of the documentation I have says to do it that way? How do I know I'm supplying the right libraries? Which libraries do I use for which revision of the compiler?"

Option 2 is laughable from a system programmer's point of view. "You mean I have to dig around in 28+ different architecture's compiler revision's interactions with the (potentially vendor) linker with an eye to the C++ features being currently used in a codebase constantly modified by 40 people and ensure I get the options correct? Oh, and it has to be maintainable by someone that isn't me and nonfragile in our build system?"

That damned of you do, and damned if you don't is the other 50% of the idiocy. There is no good solution.

It gets even better. Since it was obvious to me that the stdc++ library tried to resolve that __dynamic_cast_2 symbol at runtime, if I manage to link the stdc++ statically through manually specifying the ld link line, what happens when it hits it at runtime? Let's try it:

Call the linker by hand fixing up the static linking of the stdc++ library but leaving dynamic libc and libm:

/usr/lib/gcc-lib/i386-redhat-linux/2.96/collect2 -m elf_i386 \
-dynamic-linker /lib/ld-linux.so.2 \
-o hello \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crt1.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crti.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtbegin.o \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96 \
-L/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../.. \
hello.o \
-Bstatic -lstdc++ -Bdynamic -lm -lgcc -lc -lgcc \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtend.o \
/usr/lib/gcc-lib/i386-redhat-linux/2.96/../../../crtn.o

Linux rh7.2 > ldd ./hello
        libm.so.6 => /lib/i686/libm.so.6 (0x40033000)
        libc.so.6 => /lib/i686/libc.so.6 (0x40056000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

That looks promising. The dynamic cast appears to function on the machine it was compiled in when linked in this fashion, which is a tad bit surprising. Let's see what happens when we move it to the SuSE 8.1 machine:

Hmm, it worked on the SuSE 8.1 machine. That's definitely surprising. It sure beats the hell out of me why without serious time investment.

Here is my line of reasoning which makes me not understand why it works: If the dynamic linker was wanting to load the __dynamic_cast_2 function at runtime before, it implies that the function wasn't there in the original link pass to create the executable and so therefore wouldn't be brought into the executable at all--which is why the dynamic linker loader was trying to find it at runtime. I was pretty sure the link pass to create the executable would not bring in the required object files and the program would segfault since there wasn't a fancy linker loader telling it something was wrong. So, why didn't it segfault?

Obviously the problem that started this whole thing was that the C++ ABIs changed radically a few times between gcc 2.96, found on the redhat 7.2 machine, and gcc 3.2.2, found on the SuSE 8.1 machine. Evolution of the compiler yadda yadda yadda. However the thing that pisses me off is that the runtime of language, and other compiler internals, are shared libraries at all. Sure, from the point of view of sharing text when running multiple programs it makes sense, but from a binary compatibility point of view it is a disaster. Why isn't it made easier to package together the run time statically into the binary? Why would I have to hand invoke the linker to do something that any reasonable person whould have desired from the beginning?

This is the example of the insidious idiocy. More and more time is being spent to understand how to do something that should be simple or shouldn't have to be done at all.

I'm sure I'll get an itch and figure out the exact mechanism for why it ended up working (I already started poking it) and do another post explaining it in the future. But for now, the post traumatic stress disorder episode has passed and I am resting comfortably. The booze helps. It helps a lot.

I hate compilers.

End of Line.

< Previous | Next >