Old Projects of Christopher Re (Chris Re)

Probabilistic DB Book is out. My coauthors (Dan Suciu, Dan Olteanu, and Christoph Koch) have written a wonderful book about probabilistic databases in Morgan and Claypool's Synthesis Lectures for Data Management.
ICDE 2012 Xixuan (Aaron) Feng, Fei Chen and Min Wang from HP Labs-China, and I have authored a paper titled Optimizing Statistical Information Extraction Programs Over Evolving Text about efficiently maintaining conditional random fields on evolving coropra, i.e., new documents are added or old documents are modified. The paper is accepted to ICDE 2012 in Washington, DC. (CRC and Full Version)
TODS in 2012 Dan Suciu and I have a paper, Understanding Cardinality Estimation using Entropy Maximization, that has been accepted to TODS. I would like to thank the referees who produced thorough reviews that helped us to improve the paper. It is the journal version of our PODS 2010 paper. A preliminary version is here.
NIPS 2011
- Big Learning I'm giving a keynote at Big Learning. The organizers have put together some great keynotes and one given by some database guy.
- HOGWILD! Ben Recht, Steve Wright, Feng Niu, and I have a new way of parallelizing incremental gradient algorithms. The biggest obstacle to achieving linear speedup is minimizing lock contention. Hogwild's approach is simple: get rid of locking entirely! We prove that as long as the data are sparse, Hogwild achieves linear speedups. We demonstrate our theory on a diverse set of problems including text classification with support vector machines, cut problems from vision applications, and recommendation via matrix factorization. (Accepted NIPS 2011)
  - Code
  - Preprint on Optimization Online (with proofs)
Victor: Mathematical Optimization + Large Data Projects
- Jellyfish Ben Recht and I have released some software for large-scale matrix completion. If your algorithm is faster on billion-entry matrices, send it to us so we can learn how to go faster. Currently, we are two orders of magnitude faster to the same error (RMSE) versus any algorithm that we know about. The algorithm is essentially buzzword complete: a large-scale parallel stochastic gradient algorithm for nonconvex relaxations
  - Code is available (with data generators). Paper on Optimization Online.
  - Thank you, Office of Naval Research and Physical Layer Systems for supporting this work!
- Victor-SQL integrates incremental schemes with an RDBMS via a (hopefully) easy-to-use python interface. Available at the Victor website.
VLDB 2011. Look for three papers in the upcoming VLDB 2011:
- Tuffy and Manimal get some press. Thank you to those who have posted references to Tuffy and Manimal on Twitter, Y-combinator, O'Reilly online, and Radar Online. If you use our stuff and have any feedback (positive or negative), please send us a note.
- "Incrementally Maintaining Classification using an RDBMS" by M. Levent Koc and me. The source code for this project will be released as part of Hazy very soon.
- "Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS" by Feng Niu, AnHai Doan, Jude Shavlik, and me. A prototype version of the system is available off the Hazy Website.
- "Automatic Optimization for MapReduce Programs" by Mike Cafarella, Eaman Jahani, and me.
Best of PODS 2010: our paper, Transducing Markov Sequences, has been invited to a special issue of JACM for the best papers of PODS 2010.
Best of PODS 2010: our paper, Understanding Cardinality Estimation using Entropy Maximization, has been invited to a special issue of TODS for the best papers of PODS 2010.
A new WebDB 2010 paper with Michael Cafarella, Manimal: Relational Optimization for Data-Intensive Programs. Manimal is a hybrid system that does relational style optimization for MapReduce programs by performing a static analysis of Java code.
ACM SIGMOD was kind enough to give me the ACM SIGMOD Jim Gray Thesis Award for my dissertation, Managing Large-Scale Probabilistic Databases. Hearing Jim Gray talk about his work on the World Wide Telescope project was what inspired me to study databases. It is a real honor to me that his name is on this award.
I have two new PODS 2010 papers:
- Transducing Markov Sequences with Benny Kimelfeld that studies how to evaluate transducers (think: automaton with output) over the output of a Hidden Markov Model. This problem is motivated by challenges that we faced while building the querying infrastructure of Lahar (see below).
- Understanding Cardinality Estimation using Entropy Maximization with Dan Suciu. In this paper, we ask a very basic question: "Given some statistical information, what is the best cardinality estimate that one can make?" Shockingly, we can (sometimes) answer this question!
Queries and Materialized Views on Probabilistic Databases with Nilesh Dalvi and Dan Suciu appears in the Journal of Computer and System Sciences. This paper is the best explanation of our work on using materialized views to answer queries in a probabilistic database.