A Survey of Garbage Collection Techniques
ECE752 Advanced Computer Architecture
Prof J.E. Smith
This paper starts with some background information on memory allocation and deallocation in computing history, to arrive at the point why Garbage Collection (GC) is imperative in today’s computing world. It then discusses some benefit and cost issues associated with GC: time, space, and locality tradeoffs.
This paper surveys a number of classical GC algorithms: Reference Counting, Mark and Sweep, Copying, incremental and Generational GC. It categories the basic classes of the collectors, analyzes each approach’s pros and cons and in comparison with other algorithms. It also examines some implementation techniques and critical points, and demonstrates some optimizations that have been come up for them. Emphasis is placed upon the more promising Generational GC for it has been widely used.
Even though memory sizes have grown exponentially over the years, it is not an inexhaustible resource, thus requiring conservation. In the early days of computing, storage is allocated statically, which is certainly faster than dynamic allocation since no stack frames type data structures need to be maintained during program execution. But the obvious downside is that the data structure sizes have to be known at compile time and there is no run-time allocation. This was far from satisfying for software developers. In 1958, Algol-58 came to life as one of the first block-structured languages. Block-structured languages do away with some of the static allocation limitations by allocating storage, or frames, on a stack. A frame is pushed onto the system stack when a procedure is called, and popped when the procedure terminates. Stack allocation gives the programmer more flexibility since there are different copies of local variables in different calling instances. For example, it enables recursive calls. But the Last-In-First-Out (LIFO) characteristic of stacks limits the frame to live no longer than its caller does. And since each stack frame is of fixed size, the return objects have to be of known sizes at compile time. Heap allocation gives programmer more freedom for it allows for any order of data structure creation and deallocation – the dynamic data can outlive its creator. In heap allocation, data structures can grow or shrink dynamically. This gives the programmer much more flexibility and can assign space more efficiently at user requests.
The convenience of the heap allocation comes at the cost of deallocation management. In languages like C and Pascal, the burden of deallocation is put on the programmer. But this is often too error prone, as most of us have experienced the trouble of dangling pointers. And in some cases, for example, when a dynamically allocated object been passed onto other functions, it is impossible for the programmer or the compiler to predict when the object will no longer be needed for it could survive longer than the caller. Today as the trend of object-oriented programming continues, languages that are dynamic in nature have a large number of data objects created, used, then die throughout the execution of a program. This necessitates a system wide memory deallocator, the garbage collector. An efficient collector frees the programmer from the burden of deallocation, accomplishes the error free, and is able to improve the system performance.
Again, benefits come at a cost. Garbage collection (GC) has its overhead. As implementation techniques improve, the overhead has been reduced substantially, a 10 percent overhead of the overall execution time is reasonable for an well-implemented system [Wilson, 1994]. Some GC algorithms also have large memory requirement for it to run (e.g. the copy collector described later). GC may have detrimental effects on performance in some cases, because the placement of objects in memory can occur in a rather non-systematic, unpredictable way. In contrast, most computers are dependent on memory data placement for good performance. In particular, high performance computers use memory hierarchies which perform better when data (objects) accessed close together in time are also placed close together in memory -- this is a form of data "locality" that is exploited by cache memories and paged virtual memory systems.
Summary of work in the area
A number of GC algorithms have been proposed in the literature. A classical algorithm is "Reference Counting" [Collins, 1960], which associates with each allocated block of memory a counter the number of pointers pointing to that block. Each additional pointer referencing the block increments the counter; as pointers withdraw from pointing to it the count decrements. The block is reclaimed as free as its count drops to zero.
The other class of GC algorithms is the tracing schemes. Instead of maintaining the state of each piece-wise memory being free or referenced at all time, it scans though a large portion of the memory space periodically to determine such states, and then processes the "garbage", or reclaims the free memory.
One example of such schemes is the "Mark and Sweep" algorithm [McCarthy, 1960]. The algorithm traverses through the whole heap, marks all the live objects, and then "sweeps" all the unmarked "garbage" back to the main memory.
A out-grow of the tracing scheme are the moving collectors. This class of algorithms involves moving data to different area of memory depending of their state of liveness. A classical example of it is the Copy algorithm. It divides the heap into two subspaces, one labeled "FromSpace" and the other "ToSpace". After tracing through all the objects in one half of the memory space, it moves all the live objects in the "FromSpace" to the "ToSpace", and declares this half the recycled garbage, or the free memory pool. After a period of time, it does memory reclamation on the other half and reverse the role of the two half spaces.
Each of these algorithms has its strength and shortcomings. As a result, new and improved algorithms kept coming out.
One optimization to the tracing and moving collectors is the incremental version. The incremental schemes solve the problem of the active process having to be halted for a long delay while the GC routine is running. The concurrent collector interleaves GC with application execution. This makes it attractive for the user interactive and real-time system.
A wildly acknowledged variation of the moving collector is the Generational Collector [Appel, 1989]. This algorithm divides memory into spaces. When one space is garbage collected, the live objects are moved to another space. It has been observed that some objects are long-lived and some are short-lived. By carefully arranging the objects, objects of similar ages can be kept in the same space, thus causing more frequent and small-scaled GC's on certain spaces.
Below, I’ll be analyzing each algorithm in more detail about some important implementation techniques, discuss the trade-off associated with each, and some interesting optimizations came up for them.
One advantage of Reference Counting is its immediacy in reclamation of memory: the block is freed as soon as the count drops to zero, which is an ideal characteristic for memory critical cases. As the reference counter is maintained through the run of the program, the GC overhead is distributed throughout the run. This is in contrast to some tracing schemes where the user programs have to be suspended in order to for the collector to do a complete trace of a portion of the objects, which might be unbearable in real-time systems.
But the very fact of having to update the count upon every allocation or changing reference is in itself a cost, in terms of time and space taken up to hold the count. For example, the user program's performance degradation is more severe when the objects allocated are of small sizes and frequently referenced by a larger number of other. There have been implementation optimizations come up to reduce the overhead. For example, Deutsch and Bobrow’s deferred reference counting technique [Deutsch and Bobrow, 1976] proposed to save transaction time and space overhead by deferring updating the count to a convenient time and not storing count in the block itself. The algorithm depends on the fact that most cells are referenced exactly once – thus not to be recorded, and that reference counts need only be accurate when storage is about to be reclaimed. It requires a transaction file stores changes to reference counts, and multiple reference recording hash tables, e.g. ZCT table recording zero reference count cells, MRT recording referenced greater than one, and VRT recording the variables holding pointers into the heap. GC is done by manipulating the tables, and scanning the ZCT and VRT for free objects. But I have not found any implementation of it in practice. I think this is probably because that one attractive feature of Reference Counting is its simplicity in implementation, but this optimization scheme seems to have defeated it. The idea of deferred count update to save work is interesting, but then again, this may defeat the characteristics of Reference counting’s reclamation immediacy if the delay is too long.
The biggest flaw of reference counting is that it doesn't detect cyclic garbage. While tracing schemes have no problem detecting cyclic garbage, some authors have suggested the use of reference counting in conjecture with a tracing scheme [Deutsch and Bobrow, 1976]. For example, use reference counting until memory is close to exhaust. At that point, and invoke a tracing collector to pick up undetected garbage.
Reference Counting has been used as the primary method of memory management for systems that couldn’t tolerate the long pause delays from the tracing schemes, such as Smalltalk, Modula-2+ and SISAL, awk, and perl. Although elaborate and complex optimizations have been come up for it, Reference Counting has been considered as inefficient for its overhead associated with it and because less costly algorithms has been come up.
Mark and Sweep
An often-implemented classical algorithm is the "Mark and Sweep" handles cyclic pointer structures nicely. The original algorithm proposed by McCarthy works like following [McCarthy, 1960]. On allocation, if the free memory pool is empty, invoke the GC routine. It does a global traversal on all live objects via a Depth-First or Breath-First search, starting from a root set of active objects, to determine which are available for reclamation. Blocks are not reclaimed until all available storage is exhausted. During the sweep phase, the heap can be scanned from bottom up and put the unmarked objects, or the garbage, back into the free memory pool. If enough free memory is swept, the request is satisfied and process is resumed.
As with all tracing schemes, once GC is started, it has to go through all the live objects non-stop in order to complete the marking phase. This could potentially slow down other processes significantly, not to mention having to stop the requestor process completely. For example, Fateman [Foderaro and Fateman, 1981] found that, some large Lisp programs were spending 25 to 40% of the execution time marking and sweeping, and that users were waiting for an average of 4.5 seconds out of every 79 seconds. This is clearly not acceptable in the case of real-time systems or where response time is important. But when response time is not critical, "Mark and Sweep" does offer better performance than "Reference Counting", as there is no per-access cost. Still, it bears a high cost since all objects have to be examined during the "sweep" phase, in other words, the workload is proportional to the heap size instead of the number of the live objects.
Data in a mark-swept heap tend to be more fragmented due to the scattered location of the reclaimed garbage. Fragmentation hurts the spatial locality, and invokes more frequent cache misses. Fragmentation also tends to trigger more page faults in a virtual memory system, and lead to an increase in the size of the program's working set. Reference counting, on the hand, does not affect the heap arrangement.
Also, when memory space is critical, calls to the GC routine become more often among active processes. This could result in competition of the CPU time between collector routines and the user programs while GC gain may be small.
Tracing and moving schemes need to maintain a set of all the roots of the active processes, and to be able to follow the search path to all the active objects. The set of roots and set of pointers need to be maintained precisely for a rigorous GC – being able distinguish precisely what is garbage and what is not. One advantage of tracing collectors over moving collectors is that this requirement can be relaxed in the tracing schemes. In other words, whenever a pointer like data structure is encountered during marking and can’t determine precisely whether it is a pointer, employ the "conservative GC" strategy – treat it as if it is non-pointer data. Hence no risk updating non-pointer data with an incorrect value, i.e. does not recycle questionable data. Moving scheme, as we will see later, need to be able to distinguish pointers from non-pointer precisely since it has to move active data structures around.
Another classical GC method is the copying algorithm. Like the "Mark and Sweep" algorithm, it does not impose overhead on the active processes (e.g. updating counters). In the process of copying, all live objects can be compacted into the bottom of the new space, mitigating the fragmentation problem. Compaction is a big attraction of the Copying algorithm. By bringing out the benefit of better memory organization, compaction improves cache locality, which is crucial for good performance. Compaction also greatly simplifies the allocation compared to the case of allocating from not compacted memory. A comparison of two pointers and then an increment of address do the job. If the next available address is beyond the current half space, it’s time to GC on this half. No complication for variable sized or large structure allocation – no need to go through a large portion of the memory space trying to find a block that fits.
In terms of GC cost, it is less than that of "Mark and Sweep" for its workload is proportional to the number of the live objects instead of the entire memory space. For many object-oriented languages, it is typical that many objects don’t survive to the next collection time provided the collection interval is long enough. Thus, most objects don’t need to be copied over to the new space. For example, Standard ML of New Jersey (SML/NJ)’s implementation typically reclaims over 98 percent of the heap at each garbage collection [Appel, 1992]. On the other hand, we could argue that copying style collectors probably don’t work well with programs having numerous and persistent large data objects because of the cost of copying.
The fundamental weakness of the algorithm is the requirement of the doubled space compared to non-copying collectors. Unless the physical memory is large enough, this algorithm could suffer more page faults than its non-copying counterparts. And after moving all data to the new space, the data cache memory will have been swept clean and will suffer compulsory misses when computation resumes.
As with the "Mark and Sweep", it doesn’t work well when memory is tight. As the number of allocation requests increases, the gain from GC’s decreases, and collections become more frequent, could lead to thrashing. If the collector is intelligent enough, it should then inform the central memory management unit to make the call that it is out of memory.
The actual process of copying can be costly and complex in the original proposal of the recursive algorithm. Cheney’s elegant algorithm lead the Copying collector to wide use in practice, and is now commonly used in moving collectors in general [Cheney, 1970]. His algorithm is iterative which greatly reduces the cost of run, and simpler than the recursive one – it does the job two pointers. The objects are assigned one of the three colors:
Black: indicating the object has been visited by the collector during this collection cycle, and confirmed live.
Grey: indicating this node has been visited, but not all of its children has been scanned.
White: indicating it hasn’t been visited, and if it remains white at the end of the cycle, it’s to be recycled.
A collection ends when all reachable nodes have been scanned, i.e. no more grey nodes left. At the beginning, it does a flip of the two spaces, the FromSpace and the ToSpace, and initializes the two pointers scan and free to point to the bottom of the ToSpace. The roots of the pointers are copied into the ToSpace. At each iteration of the main loop, the grey node is scanned for more children. When all of a grey node’s children are copied to the ToSpace, the grey node turns black, and the same process repeats for the next grey node. The scan pointer is always updated to point to the first grey node or the last black node, and the free pointer is to keep track the last grey node. After a node is copied into the ToSpace, a forwarding address is set to indicate the new address.
Overall, the copying collector has the benefit of compaction, a smaller workload provided relatively few survives a collection. But it requires a large enough memory for it is more likely to cause page faults due to the regular reuse of the two semi-spaces.
This is an example of a Cheney’s algorithm in action.
X’ denotes the forwarding address (the cell’s new address) left in the cell after it is copied to the ToSpace, for other pointers to follow.
Figure 1. Cheney’s algorithm: initial configuration
Figure 2. Cheney’s algorithm:
A and B scanned,
C copied, not scanned yet.
Figure 3: Cheney’s algorithm:
Concurrent Incremental Collectors for Real-time
One thing to keep in mind is that the collectors described above are simplified versions of the actually implementations in the real world. There are many optimizations and variations of each in practice.
One big obstacle both the tracing and copying style collectors face is the long pause delay application have to experience when tracing the memory space. For truly real-time applications, this is not acceptable. One variation to remedy this rises from the idea of allowing program to run while doing small-scaled GC. Incremental collector that do fine-grained piece-wise GC, while interleaved with program executions are popular. The difficulty with this scheme is that the state of the objects’ liveness may be changing by the program during the tracing. The running program is called the mutator for it may alter the graph of objects while the collector "isn’t watching" [Dijkstra, 1978]. Again, the notion of tri-color marking is used here. GC can be seen intuitively as the moving of the grey wavefront to over the whites.
Conservatism is the key to resolve the possible conflict between the mutator and the collector. For example, if the mutator causes an already traced object to die, this piece of garbage can wait til the next cycle for reclamation. On the other hand, the collector has to guard against the mutator’s muddling with the white objects. The notion of the read-barrier and write-barriers are introduced for this purpose [Wilson, 1992] . The read barrier approach disallows mutators to see any white objects. Upon detection of the mutator’s attempt to access a white object, the collector visits it right away and colors it grey. The write barrier method records where the mutator writes black to white pointers and changes it to grey so that the collector can visit or revisit the object in question.
The best known incremental GC is Bakers’s copying collector [Baker, 1978]. It uses a basic Cheney copying method plus the read barrier approach that allows object allocation during the collection. Allocation during the collection is allowed in the ToSpace. These new objects are assumed alive, and colored black immediately, thus will not be reclaimed until the next garbage collection cycle. Read barrier is quite expensive on stock hardware. Lisp Machines have special purpose hardware to detect pointers into fromspace and trap to a handler, and it is on the order of tens of percent execution time on conventional systems [Wilson, 1992].
Simple counting, tracing and moving schemes suffer from a number of flaws such as the workload of GC, the long delay the requesting process has to experience, the locality degradation, etc. These schemes also seem to waste a large amount of time on long lived objects for having to repeatedly counting, mark or copying them. One solution for this the "generational" garbage collection. By creating generation spaces, it avoids the cost of moving the long-lived data structures around.
As a simpler and less costly approach to the long pause delay problem, generational GC collects a portion of the heap at a time, minimizing the delay time, and makes GC feasible for interactive systems.
Generational GC’s characteristic of frequent collection on certain age spaces makes it very application dependent. Many applications today, especially the ones written functional or object-oriented languages place high demand on memory. OOP programs make much greater use of the dynamic data structure than the traditional procedural programs. For example, SML/NJ program may allocate a new word every thirty instructions [Appel, 1989]. And there is strong evidence that a majority of objects die young while a small portion live long. Wilson find 80 to 98 percent of all newly allocated objects die within a few million instructions [Wilson, 1992].
Rooted from the moving collector scheme, a generational garbage collector retains the benefit of compaction: enhancing the locality, and simplifying allocation without the paying the high price of the copy collector. It avoids the compulsory misses and the cost of moving by trying to leave most of the live objects in their long-lived spaces. It amplifies the benefit of the small workload even further, not only is it proportional to the number of the live objects, but that in a smaller space with possibly higher gains and less extra copying work.
Generational GC implementations first divide the heap into two or more generations, each generation into two semi-spaces. Allocations first start in the youngest generation. When space is strained, apply the copy collector scheme on the two semi-spaces of that generation. After a number of GC’s, some objects become old enough and promoted to an older generation. When an older generation semi-space becomes filled up, run GC on that generation. The process of collection on spaces and promotions goes on. There are many implementation issues have been researched.
Generational GC has been proven to be very successful and are widely used including all commercial Lisps, Modula-3, SML/NJ, Glasgow haskell, and commercial Smalltalk systems from apple, Didigitalk, Tektronix and PARCPlace Systems.
When designing a generational GC system, the first question is to determine how many generations to have. The number of generations varies widely among implementations. SML/NJ used two generations whereas Tektronix 4406 Smalltalk used seven [Appel, 1989]. I think when implementing a generational GC, one should bear in mind the applications that will be run on the system, since generational GC is very application dependent. And it would be interesting to run some trials for that particular style of applications to help determine the number of generations. There are trade-offs between having more small fine-grained generations and fewer large age-spanned ones. The smaller the generations, the quicker the GC, and the user experiences smaller pause delay. But the smaller generations also fills up faster, causing more frequent GC’s. One would need to know the particular characteristics of the applications data to be able to determine the number of generations to have. For example, when designing a collector with Java applications in mind, one could run many Java programs or benchmarks get an idea of the distribution of object sizes and life spans, and experiment with collectors of different numbers of generations to see if there exists some optimal range of generations.
Other issues related to the number of the generations are the promotion rate and collection time. How many GC’s does an object have to survive to be promoted to an older generation? Promoting or GC too soon produces more inter-generational pointers, and too late causes excess unnecessary copying. Ungar [ungar, 1984] suggested that the number of objects that survive two collections is much less than the numbers that survive one (fast exponential decay), while raising the standard of promotion to the number of collections beyond two only reduces the number of survivors slightly. SML/NJ takes a different approach [Appel, 1989]. As SML/NJ expects typically 2 percent of the younger generation to survive a collection, the new generation area was maximized to avoid promoting, or copying data to the old generation as much as possible.
I think that the timing issue is in itself a paradox, we could learn the program flow and try to decide dynamically instead of trying to determine statically. We could for example, calculate the GC time taking into account the last GC data. If little memory was reclaimed, delaying the next GC time may be a good idea; if almost no objects survive the last GC, it might suggest we could shorten the collection interval, especially when memory space in high demand. Fortunately (or unfortunately) this idea had its calling already. Ungar and Jackson came up an innovative advancement policy called demographic feedback-mediated tenuring, for a two-generation collector [ungar and Jackson, 1992]. The policy has the following rules:
Only tenure (promote) when it is necessary. The number of objects that survives a collection is a predictor of how long the next scavenge will take since the collection time is proportional to the number of objects to be copied. If few survive, it’s probably not worth promoting them.
Only tenure as many objects as necessary. If the number of the survivors suggests that the next collection would take too long, the age threshold is set to a value designed to promote the excess data. The survivors’ ages are recorded into a table. And the table is then scan to look up the appropriate promotion threshold for the next collection.
It would be interesting to see how well this collector performs for real systems, but unfortunately I didn’t find much data about it. I find the idea is very interesting, and neat ideas that prevail in practice are often the ones with simplistic and clean implementations as well. Hopefully this one’s implementation scheme is not too elaborate making it more work for the CPU than simpler approaches. I think that dynamically predicting the timing is a promising approach. And there are other methods implementing this approach to be examined. This hopefully serves as an eye opener for another way of looking at GC – there are lots of rooms for innovations and it can go beyond the boundary of static.
There are other important implementation issues associated with generational GC; one is around pointers. As live objects can be in different generation spaces in this scheme, the issue of inter-generational pointers (e.g. a pointer from an older generation to a younger one) rises. To GC on a new generation, the collector must be able to find the complete root set of the generation, this includes all the inter-generational pointers from the older generations to this one. This requires cooperation between the collector and the mutator. The old-to-young references created by promotion can be track of by the collector, and the write barrier trap can record the old-to-young’s created via assignment by the mutator. There are typically more young-to-old inter-generational pointers than old-to-young for any particular generation, probably because the younger generations are more likely to fill up than the old ones. Hence, it would be more expensive to gather all the young-to-old pointers for an old generation to do GC on it alone. Collecting on both the old and new generations whenever the old has be GC’ed would probably be more feasible for no need to keep track of young-to-old pointers, and reasonable since the younger generation need to be GC’ed more frequently anyway.
Some performance statistics I gathered: for an optimizing compiler for SELF, a highly OOP language, Chambers reported a 4 to 27 percent GC overhead [Chambers, 1992]. Appel found a 5 to 10 percent overhead for the SML/NJ [Appel, 1989].
The ideal Garbage Collector should have low CPU overhead, minimum delay for users, and betters the memory layout to give good virtual memory and cache performances. Since the cost of GC is highly application and language dependent, and relies heavily on demographics such as the distributions of object live spans and sizes, I think it’s reasonable to develop implementations that are specific to certain language styles. As OOP is in demand, a GC that designed with such language characteristics in mind will have a performance edge. I think that’s an important attribute to the generational GC’s success: it is very "object oriented". As presented in the paper, we see that optimizations and new ideas built one on top of another, and GC has become quite efficient. But there are still issues unresolved and room for improvement, even with the fairly successful generational collector.
[wilson, 1992] Paul R. Wilson. Uniprocessor garbage collection techniques. Technical report, University of Texas, January 1994.
[Collins, 1960] George E. Collins. A method for overlapping and erasure of lists. Communications of the ACM, 3(12):655-657, December 1960.
[McCarthy, 1960] John McCarthy. Recursive functions of symbolic expressions and their computation by machine. Communications of the ACM, 3:184-195, 1960.
[Appel, 1989] Andrew W. Appel. Simple generational garbage collection and fast allocation. Software Practice and Experience, 19(2):171-183, 1989.
[Deutsch and Bobrow, 1976] L. Peter Deutsch and Daniel G. Bobrow. An efficient incremental automatic garbage collector. Communications of the ACM, 19(9):522-526, September 1976.
[Foderaro and Fateman, 1981] John K. Foderaro and Richard J. Fateman. Characterization of VAX Macsyma. In 1981 ACM Symposium on Symbolic and Algebraic Computation, pages 14-19, Berkeley, CA, 1981. ACM Press.
[Appel, 1992] Andrew W. Appel. Compilers and runtime systems for languages with garbage collection. Proceedings of SIGPLAN'92 Conference on Programming Languages Design and Implementation, volume 27 of ACM SIGPLAN Notices, San Francisco, CA, June 1992. ACM Press.
[Cheney, 1970] C. J. Cheney. A non-recursive list compacting algorithm. Communications of the ACM, 13(11):677-8, November 1970.
[Dijkstra, 1978] Edsgar W. Dijkstra, Leslie Lamport, A. J. Martin, C. S. Scholten, and E. F. M. Steffens. On-the-fly garbage collection: An exercise in cooperation. Communications of the ACM, 21(11):965-975, November 1978.
[Baker, 1978] Henry G. Baker. List processing in real-time on a serial computer. Communications of the ACM, 21(4):280-94, 1978.
[chambers, 1992] Craig Chambers. The Design and Implementation of the SELF Compiler, an Optimizing Compiler for an Objected-Oriented Programming Language. PhD thesis, Stanford University, March 1992.
[unga84] David M. Ungar. Generation scavenging: A non-disruptive high performance storage reclamation algorithm. ACM SIGPLAN Notices, 19(5):157-167, April 1984.
[Ungar and Jackson, 1992] David M. Ungar and Frank Jackson. An adaptive tenuring policy for generation scavengers. ACM Transactions on Programming Languages and Systems, 14(1):1-27, 1992.