repz ret and the empty functions

Introduction

Let's say that one day you decide that you should examine the assembly of some programs. Because hey, why not. You notice that a function is frequently being called, but that it contains only one instruction: "repz retq" (potentially followed by bunch of nops). This seems slightly odd, so you figure out where it's coming from. Here's an example:

typedef struct {           /* standard complex number declaration for single- */
   double real;            /* precision complex numbers                       */
   double imag;
} complex;


complex cmplx( double x, double y )  {
    complex c;
    c.real = x; c.imag = y;
    return(c);
}

If you compile this with gcc and optimizations enabled (at least -O1), then run objdump on the binary, you see:

file format elf64-x86-64


Disassembly of section .text:

0000000000000000 :
   0:	f3 c3			repz retq 

How can this be?

repz ret

So, repz is a prefix that repeats the following instruction until some register is 0. Also, it only works on string instructions; otherwise the behavior is undefined. So what on earth is gcc doing generating a repz retq?

Luckily, some poor soul had already run into this problem and has an explanation at their website, aptly named repzret.org. It has to do with the AMD K8, collisions in a branch predictor, and decoding "nop; ret" being more expensive than "repz ret". So that's fascinating, and explains why the function contains the repz. However, it doesn't explain why gcc is generating an empty function. What happened to the code in it?

Function calling conventions for x86_64

This actually confused me for a while -- the function cmplx was getting called all over the place in the program, and it seemed really bad behavior to have large numbers of call cmplx / repz retq pairs all over the place. (In fact, the single most common cause for a data cache access in the program was this function call. By FAR. Although at least they were all hits.) At first I thought there was some interfunction optimization going on, but then why not eliminate it entirely? And how can that even work when I was compiling with separate compilation?

The answer, of course, is that the function doesn't actually do much, even as written. It constructs a struct and then returns it -- by value. And how do function arguments and return values get passed between caller and callee? Well, actually, that turns out to not be simple. There's an excellent reference at this site about calling conventions, which includes excellent adjectives for the x86_64 Unix ABI calling convention, such as "insanely overcomplex" and "byzantine". If you want to pass arguments, you put them on the call stack by value, or perhaps by address, or in the registers RDI, RSI, RDX, RCS, R8, and R9 -- in that order, or maybe in XXM(0)-XMM(7), or else in YMM(0)-YMM(X). Also, combinations of these. In our case, the arguments are doubles, and are passed in xmm0 and xmm1.

The rules for return values are also "insanely overcomplex", but similar. We want to return a struct (note that this is not a pointer to a stuct, but the struct itself). And apparently structs are classified by what fields they contain, in this case, two doubles. And so, we pass the return values in.... xmm0 and xmm1. Hmm! And that's where they are, when you call the function! So all it has to do is return.

Conclusion

I'm not quite sure what the correct response here is -- dcache hits are cheap, but it still seems pretty silly to call this function. (Although I would like to note that the behavior is much better than what you would get if you switch the order of arguments x and y....). Could you use a macro instead? This case probably wouldn't happen with a C++ constructor, since those are in the header and thus visible to the compiler. I mean, from the perspective of clear, well-written code what they have may be correct; however, almost 20% of dcache accesses were due to this call!

To be honest, I don't care too much, outside of idle curiosity, but since this issue confused me for an hour or so, and I figured it out thanks to other people's pages, I figured I'd put this out there for anyone googling for why gcc is generating empty functions (or functions that only contain what looks like a nonsensical instruction).