Overall merit
=============
Accept
Paper summary
=============
This paper focuses on the equi-join operator. Three classes of algorithms are considered, namely loops-based, hash-based, and sort-based methods. The hash-based and sort-based methods use a divide-and-conquer approach to divide each input relation into parts that can then be joined. For the hash-based methods, a hash-partitioning approach is used. For the sort-based methods, input runs on each relation is created then all the runs are merged.
Hash-based methods are more memory-efficient requiring sqrt(|R|) pages to join the relations in two passes, whereas the sort-based methods need sqrt(|S|) pages, where R and S are the two relations being joined, |R| is the number of pages in R and |R| < |S|.
Three flavors of the hash-based methods are discussed: simple, grace, and hybrid hash. Join optimization methods are noted, including a technique that creates a bit-filter or a projection of the join columns on one relation, and uses this information to filter tuples on the other relation before consideration. Thus, tuples that don't join can be eliminated early on, improving the overall join performance.
Hash-based methods have to deal with partition overflow, and a key methods is to repartition large partitions.
Comments and questions
======================
C1. There is discussion of virtual memory management, but perhaps some of that is not relevant with modern operating systems.
C2. The evaluation component of this paper is largely analytical, which is likely the norm for papers at that time. It would be good to conduct an evaluation of various join algorithms in modern settings where servers have multi-processor, multi-cores, large memories, and flash storage.
Q1. For sort-merge join, each run has a length of 2 * |M|. Could you please explain where the factor of 2 comes from?