Computer Sciences Dept.

Cristian Estan

Thumbnail portrait
End-biased Samples for Join Cardinality Estimation
Cristian Estan, Jeffrey F. Naughton
UW CS technical report TR1541, October 2005

We present a new technique for using samples to estimate join cardinalities. This technique, which we term ``end-biased samples,'' is inspired by recent work in network traffic measurement. It improves on random samples by using coordinated pseudo-random samples and retaining the sampled values in proportion to their frequency. We show that end-biased samples always provide more accurate estimates than random samples with the same sample size. The comparison with histograms is more interesting --- while end-biased histograms are somewhat better than end-biased samples for uncorrelated data sets, end-biased samples dominate by a large margin when the data is correlated. Finally, we compare end-biased samples to the recently proposed ``skimmed sketches'' and show that neither dominates the other, that each has different and compelling strengths and weaknesses. These results suggest that end-biased samples may be a useful addition to the repertoire of techniques used for data summarization.

Paper in PDF and Postscript.

Computer Sciences | UW Home