|
I am an assistant professor
in the department of Computer
Sciences at the University of
Wisconsin-Madison. My interests are theoretical and practical problems
in data management. Details of my work can be found here. I believe that the future of
computing is in data management. If you agree, are an outstanding
student, and are looking to begin graduate work, please send me an
email.
Ongoing Project
Descriptions
MystiQ is a
probabilistic relational database designed to handle imprecision
resulting both from newer applications such as information
extraction and social
networking data and classical applications such as object
reconciliation and data cleaning. The central theme is processing
complex SQL queries on large amounts of probabilistic relational
data. This work has developed techniques such as extensional
plans for aggregates, multisimulation,
materialized
views of probabilistic data, processing of NOT EXISTS
predicates, and approximate
lineage. A recent overview of the system is in our upcoming
SUM 2008 paper. For a broader,
biased look at the state of the art, see our tutorial (powerpoint
part I & II) that was
delivered at VLDB 2008 in Auckland, New Zealand or the extended
version of our upcoming CACM
paper. Lahar is a successor
to the Peex
project which is a part of the larger Markovian Streams
Project. The goal of both projects is to manage data from the RFID ecosystem, which is a
building wide RFID deployment at the Paul Allen Center at the
University of Washington. The technical contribution of this work is
a suite of algorithms and access
methods to manage data in both near real-time and historical
streams. This project is joint work with Julie
Letchner and Prof. Magdalena
Balazinska. For an overview, please see our article IEEE
Journal of Internet Computing, Challenges for Event
Queries over Markovian Streams. And for a more detailed
account, see our ICDE 2009 research paper Access Methods for Markovian
Streams, or check out our upcoming demo at VLDB 2009 in
Lyon, France. NB: We
plan to publish the data from the RFID ecosystem soon, please check
http://lahar.cs.washington.edu
for details. We will also be donating this data to the pdbench
project. If you have probabilistic/uncertain data, I encourage
you to donate it to this great project!
Teaching
Completed Project
Descriptions Dedupalog is
a declarative language for specifying deduplication tasks. In our
upcoming ICDE 2009 paper, Large-Scale
Deduplication with Constraints using Dedupalog, we define a
syntax and semantics for our new language. Further, we provide
algorithms that can cluster massive datasets extremely fast,
e.g., cluster all of citeseer in a minute or two. The
technical key is an extremely scalable algorithm that we prove is a
constant-factor approximation of the optimal for a large fragment of
dedupalog programs. This is joint work with Dr. Arvind Arasu and
Prof. Dan Suciu that was done while visiting the DMX group at
Microsoft Research. This paper has been invited to a special issue
of TKDE for the best papers in ICDE 2009. Galax is an open-source
implementation of XQuery 1.0, the W3C XML Query Language. My work on Galax included the
design of the algebraic compiler which recovered classical
optimizations, notably join optimizations, inside the full XQuery
language. This work has continued without me to produce some very
cool work
at SIGMOD 2008. XQuery! (read:
XQuery-Bang) is a fully compositional update language that extends
XQuery 1.0, the W3C XML Query Language. The contribution is
recovering classical database optimizations (joins, cursors and
indices) while at the same time providing imperative features
(variable assignment). SilkRoute is a
platform to translate XQuery to SQL in a performant and largely
complete way. It allows users to publish their relational data
effectively and easily. XBrain is a web-based application built on
SilkRoute designed to allow researchers to query SIG’s Brain
Mapping Database. The query language used is XQuery, and the
resulting XML can be viewed directly or automatically transformed
into HTML, CSV, or visualized on an image of brain regions.
|