ACM Intermediate UNIX Tutorial - Compilers

What is a compiler, really?

I already know that!

Well, everyone knows what it is basically - it turns your source code into executables. However, there's slightly more to it than that, which I'll now go over. If you've ever read pages 1-3 of any book on C, you can probably skip over the following section.

So, what else?

A compiler's job is really to transform your code from the human-readable form (source code) to a form which means something from the computer's standpoint. This doesn't necessarily mean "executable" To be specific, your C compiler is actually supposed to turn C source code into object files. An object file is an intermediate form, where all the code has been understood and processed by the compiler, but it's not ready to run yet because it's only the code from one file. If your program is made up of multiple files, each file gets built into an object file, and there are calls between the object files that have not been resolved yet.

Hence, the object files still need to be linked together before all the symbols make sense. For example:

Suppose you have 2 files, main.c and spoo.c (plus a header for spoo), where there's a function in spoo.c called, oh, shall we say, spoo() which is called from the main.c file. Well, when you go to compile main.c and spoo.c into their respective object files (main.o and spoo.o) the compiler isn't going to look at the other source file, so when it builds main, it doesn't know what to make of the spoo function. It knows the spoo function exists, because the header declared "Hey, there's this function void spoo(void);" but it doesn't know what spoo actually does, just that there is a function called spoo. So, it basically puts in a placeholder for spoo. Eg, it puts "I'm calling void spoo(void) here, but I don't know what it is" in main.o when it compiles it.

So, how does one actually get a program that runs? The linker to the rescue! The linker is a program that, given a bunch of object files, resolves all the unresolved symbols (those place holders) in them and produces your actual executable.

What's up with g++ actually producting executables then? Well, g++ is technically a group of programs, one of which is the compiler, one of which is the linker, etc. When you invoke it on a source file, it typically assumes you want an executable, so it calls the linker for you. Nice ot it, no?

Ok, obfuscation master, what else are we supposed to know, HMM?

    Ok, the C/C++compiler actually works in a series of stages, more or less like this:
  1. The pre-processor - strips out comments, expands #define macros and #include lines, etc.
  2. The compiler - parses your source code and builds assembly output from it
  3. The assembler - takes that assembly code and builds an object file out of it
  4. The linker - takes some of those object files and such, and builds an executable out of it
  5. The loader - not generally part of the compiler suite, but part of the OS. Takes your executable and actually tries to load and run it. This may be more complicated than it sounds - for instance, your program might have unresolved symbols in it even after the linker goes over it, for instance if you used a shared library (eg. .so in unix or .dll in windows) which is supposed to be linked in at run time.

Now, you might be wondering, why would you ever want to build just an object file? Well, there are a lot of reasons. The first one is, linking together a bunch of objects is fast, but compiling them is slow! So, if you had, say, 100 source files, you could compile them all, and then link them. Then, if you changed one of them, you could recompile only that one, and then relink them all quickly without having to rebuild the other 99 of them. Much easier to program that way than waiting an hour to rebuild everything each time!

Another reason, which isn't really as good, is that you could do something like write some special, super-secret copyrighted code, and build it to an object file, and then sell it that object to people, along with the header file. They'd never be able to see your original source code, since the only thing there is some post-compilation machine code. But, they could still use your code, by calling the functions you put in the header file. This is basically how most proprietary, secret software toolkits get distributed (in essense - they might build the objects into a shared library, but it's the same idea). Of course, it might be kind of hard to debug that sort of thing...

All that is really complex - Microsoft Visual C++ doesn't do that!

Yes it does.

Ok then, how does gcc/g++ work?

Ok. Basically, you invoke the compiler by typing "gcc" or "g++", and make it do those special compiler things (or specifically, restrict it to only do part of it) by giving gcc or g++ an argument.

A quick, but important note: You have to use "gcc" to build C programs and "g++" to build C++ programs. The whole suite is called GCC, and there's even a slightly misleading note in some documentation saying that "gcc" should be able to build C++ programs by the file extension, but you need to use g++. Trust me on this.

Ok, [drumroll!] the command line arguments to gcc/g++ which you're probably most interested are:

gcc / g++
include debugging symbols - always use this or your debugger doesn't work!
Warn all - always use this too, to make the compiler really picky (and thus more helpful)
only Compile, don't link. If you just want the object file (see makefiles), use this. You don't need to give a -o or anything if you use this, since it just takes your file.c and turns it into a file.o
-o <file>
Output file - this tells it to use <file> as the output. If you don't give a -o, it uses "a.out" as the default for some reason.

You probably won't need the following options, but on the off chance you do, I've included them.

-I <directory>
Include path - this tells it where to find headers (.h files). Useful if you need to use some weird library, or would just like to be able to use the #include <file.h> syntax instead of #include "file.h" syntax for your very own header files.
-L <directory>
Library path - this tells it where to find shared libraries. You may need to make use of this if you use some package of software like ImageMagick with your program.
-O or -O2
Optimize or Optimize level 2. The compiler will try to make your program run faster, sometimes a lot faster. You don't want to use this when you're debugging etc., because it may make debugging effectively impossible due to modifying how your program works underneath!
Profile Generate - combined with a tool called gprof you can use this to find out how much time each part of your program is using, thus letting you figure out which part is slow and needs to be improved. It's unlikely you'll ever need this, but it's possible!
Become super-picky, and complain if it thinks your code doesn't meet the ANSI C/C++ specs.
Become way too picky, to the point that it may reject perfectly correct code
Only generate aSsembly code - the compiler doesn't generate a .o file, but instead actually spits out sort of human readable assembly code. Instead of making a .o file, it makes a .s file. This can be quite helpful if you were to, for instance, use the MIPS version of gcc - you could see how gcc generates some MIPS assembly code from your C program, and compare it with what you might write in 378.
Only prE-process - instead of compiling, it just outputs the program after the pre-processor has run. Can be useful for debugging macros and #includes, etc.

Ok, that was probably more than you needed, or wanted, to know. Now on to...

Some examples

Compiling some source files into an executable, "program"

gcc -g -Wall -o program file1.c file2.c file3.c

or perhaps

g++ -g -Wall -o program *.cc

Compiling to an object file, but not linking

gcc -g -Wall -c file1.c file2.c file3.c

or perhaps

g++ -g -Wall -c *.cc

Linking those object files together

gcc -g -Wall -o program file1.o file2.o file3.o

or perhaps

g++ -g -Wall -o program *.o

Building a fully optimized, pentium specific, seriously fast program (because your code is slow and you hope the compiler can fix it)

gcc -O2 -mpentium -o program file1.c file2.c file3.c
strip program

(Ok, I threw a -mpentium flag in there, which I never talked about. There are lots of flags like that in the man page. Plus, the strip was unnecessary since there was no -g and stuff :-)

Just getting the x86 assembly output to the following program:

$ cat helloworld.c
#include <stdio.h>

int main(void) {
        printf("Hello, world!\n");
        return 0;

$ gcc -S helloworld.c

$ cat helloworld.s
        .file   "helloworld.c"
        .version        "01.01"
.section        .rodata
        .string "Hello, world!\n"
        .align 4
.globl main
        .type    main,@function
        pushl %ebp
        movl %esp,%ebp
        pushl $.LC0
        call printf
        addl $4,%esp
        xorl %eax,%eax
        jmp .L1
        .align 4
        .size    main,.Lfe1-main
        .ident  "GCC: (GNU)"

Cool, huh?

Justin Husted
Last modified: Tue May 16 14:13:15 PDT 2000