- Probabilistic DB Book is out. My coauthors (Dan Suciu, Dan Olteanu, and
Christoph Koch) have written a wonderful
book
about probabilistic databases in Morgan and Claypool's Synthesis
Lectures for Data Management.
- ICDE 2012 Xixuan (Aaron) Feng, Fei
Chen and Min Wang from HP Labs-China, and I have authored a paper
titled Optimizing Statistical Information Extraction Programs Over
Evolving Text about efficiently maintaining conditional random
fields on evolving coropra, i.e., new documents are added or old
documents are modified. The paper is accepted to ICDE 2012 in
Washington,
DC. (CRC
and Full
Version)
- TODS in 2012 Dan Suciu and I have a paper, Understanding
Cardinality Estimation using Entropy Maximization, that has been
accepted to TODS. I would like to thank the referees who produced
thorough reviews that helped us to improve the paper. It is the
journal version of our PODS 2010 paper. A preliminary version is here.
- NIPS 2011
- Big Learning I'm giving a keynote
at Big Learning. The organizers
have put together some great keynotes and one given by some database
guy.
- HOGWILD! Ben Recht, Steve Wright, Feng
Niu, and I have a new way of parallelizing incremental gradient
algorithms. The biggest obstacle to achieving linear speedup is
minimizing lock contention. Hogwild's approach is simple: get rid
of locking entirely! We prove that as long as the data are sparse,
Hogwild achieves linear speedups. We demonstrate our theory on a
diverse set of problems including text classification with support
vector machines, cut problems from vision applications, and
recommendation via matrix factorization. (Accepted NIPS 2011)
- Victor: Mathematical Optimization + Large Data Projects
-
- Jellyfish Ben Recht and I have released some software
for large-scale matrix completion. If your algorithm is faster on
billion-entry matrices, send it to us so we can learn how to go
faster. Currently, we are two orders of magnitude faster to the same
error (RMSE) versus any algorithm that we know about. The algorithm is
essentially buzzword complete: a large-scale parallel stochastic
gradient algorithm for nonconvex relaxations
- Code is available (with
data generators). Paper on Optimization
Online.
- Thank you, Office of Naval Research and Physical
Layer Systems for supporting this work!
- Victor-SQL integrates incremental schemes with an RDBMS via
a (hopefully) easy-to-use python interface. Available
at the Victor website.
- VLDB 2011. Look for three papers in the upcoming VLDB 2011:
- Tuffy and Manimal get some press. Thank you to those who have
posted references to Tuffy and Manimal on Twitter, Y-combinator,
O'Reilly online, and Radar Online. If you use our stuff and have any
feedback (positive or negative), please send us a note.
- "Incrementally Maintaining Classification using an RDBMS" by M. Levent Koc and me. The source code for this project will be released as part of Hazy very soon.
- "Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS" by Feng Niu, AnHai Doan, Jude Shavlik, and me. A prototype version of the system is available off the Hazy Website.
- "Automatic Optimization for MapReduce Programs" by Mike Cafarella, Eaman Jahani, and me.
- Best of PODS 2010: our paper, Transducing Markov Sequences, has been
invited to a special issue of JACM for the best papers of PODS
2010.
- Best of PODS 2010: our paper, Understanding
Cardinality Estimation using Entropy Maximization, has been
invited to a special issue of TODS for the best papers of PODS 2010.
- A new WebDB 2010 paper
with Michael
Cafarella, Manimal: Relational Optimization for Data-Intensive
Programs. Manimal is a hybrid system that does relational style
optimization for MapReduce programs by performing a static analysis of
Java code.
- ACM SIGMOD was kind enough to give me the ACM SIGMOD Jim Gray
Thesis Award for my
dissertation, Managing
Large-Scale Probabilistic Databases. Hearing Jim Gray talk
about his work on the World Wide Telescope project was what inspired
me to study databases. It is a real honor to me that his name is on
this award.
- I have two new PODS 2010 papers:
- Queries and Materialized Views on Probabilistic Databases with
Nilesh Dalvi and Dan Suciu appears in the Journal of Computer and
System Sciences. This paper is the best explanation of our work on using
materialized views to answer queries in a probabilistic database.