Raghavan's PhD Research Summary

My thesis adviser was Professor Susan Horwitz.

One of the contributions of my PhD thesis is an algorithm [2,5] for automatically extracting a selected fragment (i.e., set of selected statements) into a separate procedure, in a semantics-preserving manner. The statements are selected by the programmer (perhaps using the help of application-specific tools). My algorithm addresses the extraction of "difficult" selected fragments: fragments in which the selected statements are non-contiguous, and/or involve exiting jump statements (jumps whose targets are outside the code region that contains the fragment). My algorithm uses code-reordering transformations to move intervening non-selected statements out of the way, in a safe manner, whenever possible. Previously proposed algorithms for same the problem either perform no code reordering, or do so under conditions that are much more restrictive than those used in my approach. Also, they do not address the issue of exiting jumps at all.

My algorithm has several applications; e.g., it can be used to break a large monolithic procedure that contains several conceptually separate strands of computation into smaller procedures, one per strand (for this, the algorithm is applied individually on each strand). This restructuring improves the understandability of the program.

Another problem I address in my thesis is the detection and elimination of duplicated fragments in source code. It is a well-known fact that real programs contain a lot of duplication. A group of clones (instances of duplicated code fragments) can be eliminated, after detection, by extracting them into a separate new procedure and replacing each clone by a call to this procedure. This restructuring activity reduces code size, and improves modularity (which in turn improves understandability, eases future enhancements and bug fixes, and increases the potential for future reuse).

My first contribution in this context is an algorithm [3] for detecting duplicated fragments (clones) in source code, and a prototype implementation of this algorithm. The novel aspect of my approach is that clones are detected by finding isomorphic subgraphs in a Program Dependence Graph (PDG) representation of the program. The key benefit of this approach is that groups of clones that match dependence-wise but do not match exactly at the source level can be detected; e.g., groups in which matching statements are not in the same order in all clones, groups in which non-matching statements intervene between matching statements (i.e., clones are non-contiguous), and groups in which variable names are not identical. Groups of clones with such difficult characteristics can be created when programmers use copy-and-paste, and then modify pasted code, to implement new features that resemble existing features. Previous approaches to clone detection, which work on the source text, or syntax-tree, or control-flow-graph representation of the program, cannot in general identify clone groups with such difficult characteristics (except for variable renaming).

My second contribution in this context is an algorithm [1] for extracting a group of clones into a single separate procedure. It addresses the extraction of difficult groups of clones, and thereby complements my clone-detection approach. It uses the single-fragment extraction algorithm (described earlier) as a subroutine, to deal with individual clones that are non-contiguous and/or involve exiting jumps. It then makes "out of order" groups "in order" by permuting the statements in each clone to bring matching statements into the same order. My algorithm has several new aspects when compared to previous clone-group-extraction algorithms: it is the first to address exiting jumps, and the first to use a pre-pass to move as many intervening non-cloned statements out of the way as possible. It also often produces better extracted procedures -- procedures that contain cleaner code and have fewer parameters.