B. Chen, R. Ramakrishnan, J. Shavlik & P. Tamma (2006).
Bellwether Analysis: Predicting Global Aggregates from Local Regions. Proceedings of the Thirty-Second International Conference on Very Large Data Bases (VLDB'06), pp. 655-666, Seoul, Korea.
This publication is available in PDF.
Massive datasets are becoming commonplace in a wide range of domains, and mining them is recognized as a challenging problem with great potential value. Motivated by this challenge, much effort has been concentrated on developing scalable versions of machine learning algorithms. An often overlooked issue is that large datasets are rarely labeled with the outputs that we wish to learn to predict, due to the human labor required. We make the key observation that analysts can often use queries to define labels for cases, which leads to the problem of learning to predict such query-produced labels. Of course, if a dataset is available in its entirety, we can simply run the query again to compute labels. The interesting scenarios are those where, after the predictive model is trained, new data is gathered at significant incremental cost and, perhaps, over time. The challenge is to accurately predict the query-labels for the projected completion of new datasets, based only on certain cost-effective subsets, which we call bellwethers.
Computer Sciences Department
College of Letters and Science
University of Wisconsin - Madison
INFORMATION ~ PEOPLE ~ GRADS ~ UNDERGRADS ~ RESEARCH ~ RESOURCES
5355a Computer Sciences and Statistics ~ 1210 West Dayton Street, Madison, WI 53706
firstname.lastname@example.org ~ voice: 608-262-1204 ~ fax: 608-262-9777