AnHai Doan: The Cimple Project on Community Information Management

The Cimple Project on Community Information Management

Building a software platform that can be rapidly deployed to manage data-rich online communities

Overview People Talks Systems/Software/Data Publications

News

Oct 17: Latest talk on the topic: Data Quality Challenges in Community Systems, invited talk at the 5th Int. Workshop on Quality in Databases (QBD 2007). Talk abstract
Oct 17: Latest papers on the topic:
- Building Community Wikipedias: A Human-Machine Approach, P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, J. Zhu. ICDE-08.
- Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE-08.
- Optimizing SQL Queries over Text Databases, A. Jain, A. Doan, L. Gravano. ICDE-08.
Jul 07: The best current papers to read to understand Cimple:
- Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07.
- Building Community Wikipedias: A Human-Machine Approach, ICDE-08
Jan 07: The Cimple Project on Community Information Management (PPT slides),
given at UC-Berkeley, IBM Almaden, Stanford, and others, by AnHai.
Jan 07: The DBLife prototype system is live here. This version is still very preliminary, but is being worked on actively. The latest features are experimented with in the "DBLife Lab", available here.
Apr 06: To quickly obtain an overview of the project, read the DEB paper.

Overview

The CIM Problem: The Web is teeming with communities, each focusing on a specific set of topics. Examples include those of movie goers, football fans, database researchers, and bioinformatists. Community members often want to aggregate community data, then query, monitor, and discover certain information. For example, database researchers might be interested in questions such as "is there any interesting connection between researchers X and Y (or topic U and V)?", "in which course is a paper P cited?", and "what is new in the past 24 hours in the database community?". As Web communities proliferate, the problem of developing effective solutions to support their information needs is becoming increasingly important. We call this problem community information management, or CIM for short.

The Cimple Project: Cimple is a joint project between the Univ. of Wisconsin-Madison and Yahoo! Research. It develops a software platform that can be rapidly deployed and customized to manage data-rich online communities. This software platform can be valuable for communities in a broad range of domains, ranging from scientific data management, government agencies, and enterprise intranets, to those on the World-Wide Web.

"Cimple" thus is shorthand for "Community Information Management Platform". It consists of four thrusts:

Extracting and integrating structure: How to efficiently and accurately extract structured data (e.g., people names, paper titles, conferences, etc.) from the raw data? How to group them into real-world entities? How to enable system builders and users to quickly incorporate domain knowledge? How to manage provenance and uncertainties in these contexts?
Maintaining structure: How to maintain the extracted structured data and associated provenance over time, as the underlying raw data evolves?
Exploiting structure: How to provide useful user services over the raw and extracted data? Example services include keyword search, structured (e.g., SQL) queries, browsing, mining, monitoring/tracking, and personalized and context-sensitive services.
Mass collaboration: The extracted structures are inherently imperfect. How to leverage the multitude of community members to improve the extracted structures, by providing either implicit or explicit feedback? How to "engineer" the system in a way that provides incentives for community members to help improve community data? How to make the system manage both community data and users in a synergistic manner?

DBLife - A Prototype System: To drive and validate Cimple, we are building DBLife, a prototype system that manages information for the database research community. . Eventually we may want to build more prototypes for research communities, such as AILife and IRLife, as well as non-research ones, such as those for the legal community and the community of movie goers.

Why This Project?: What is new, and how does it relate to databases, Web, AI, IR, data integration, and text management?

Acknowledgments: This project is funded by CAREER award IIS-0347903, a gift grant from Yahoo! Research, an IBM Faculty Award, and a Sloan Fellowship.

People

Faculty: AnHai Doan
Students: Pedro DeRose, Warren Shen, Xiaoyong Chai, Fei Chen
Postdoc: Byron J. Gao
Collaborators: Philip Bohannon (Yahoo! Research), Luis Gravano (Columbia University), Jeff Naughton (Wisconsin-Madison), Raghu Ramakrishnan (Yahoo! Research), Shivakumar Vaithyanathan (IBM Almaden Research Center), Jun Yang (Duke University)
Alumni: Doug Burdick, Robert McCann, Mayssam Sayyadian, Yoonkyong Lee

Talks

Data Quality Challenges in Community Systems, invited talk at the 5th Int. Workshop on Quality in Databases (QBD 2007). Talk abstract
The Cimple Project on Community Information Management (PPT slides),
given at UC-Berkeley, IBM Almaden, Stanford, the Northwest DB Society, and others, by AnHai in late 2006 - early 20007.
From Data Integration to Community Information Management (PPT slides), A. Doan. Yahoo! Research, 2006.

Systems/Sofware/Data

The DBlife system, under development.

Publications

Overviews, Tutorials, and Demos

DBLife: A Community Information Management Platform for the Database Research Community, P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. CIDR-07 (demo).
Community Information Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006.
Managing Information Extraction, A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD-06 Tutorial (PPT slides). See also the tutorial description in SIGMOD proceedings.

System Methodologies and Architecture

Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07.

Structure Extraction and Integration

Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan, ICDE-08.
Optimizing SQL Queries over Text Databases, A. Jain, A. Doan, L. Gravano. ICDE-08.
Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, W. Shen, A. Doan, J. Naughton, R. Ramakrishnan. VLDB-07.
A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data, E. Chu, A. Baid, T. Chen, A. Doan, J. Naughton. VLDB-07.
OLAP over Imprecise Data with Domain Constraints, D. Burdick, A. Doan, R. Ramakrishnan, S. Vaithyanathan. VLDB-07.
Souce-aware Entity Matching: A Compositional Approach, W. Shen, P. DeRose, L. Vu, A. Doan, R. Ramakrishnan. ICDE-07.

Structure Maintenance

Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan, ICDE-08.
Maveric: Mapping Maintenance for Data Integration Systems, R. McCann, B. AlShelbi, Q. Le, H. Nguyen, L. Vu, A. Doan. VLDB-05. PPT slides.

Structure Exploitation

Efficient Keyword Search across Heterogeneous Relational Databases, M. Sayyadian, H. LeKhac, A. Doan, L. Gravano. ICDE-07.
SQL Queries over Unstructured Text Databases, A. Jain, A. Doan, L. Gravano. ICDE-07 (poster).

Mass Collaboration and User-Centric Challenges

Building Community Wikipedias: A Human-Machine Approach, P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, J. Zhu. ICDE-08.
User-Centric Research Challenges in Community Information Management Systems, A. Doan, P. Bohannon, R. Ramakrishnan, X. Chai, P. DeRose, B. Gao, W. Shen. IEEE Data Engineering Bulletin, special issue on data management in social networks. 2007, invited.
Web 2.0 Style Schema Matching, R. McCann, W. Shen, A. Kramnik, A. Doan. ICDE-08.
Learning from Multiple Users to Improve Data Integration Tasks, R. McCann, W. Shen, A. Doan. Tech. Report, 2006.
Integrating Data from Disparate Sources: A Mass Collaboration Approach, R. McCann, A. Kramnik, W. Shen, V. Varadarajan, O. Sobulo, A. Doan. ICDE-05. Poster.
Building Data Integration Systems via Mass Collaboration, R. McCann, A. Doan, A. Kramnik, and V. Varadarajan. WebDB-03 Presentation
Building Data Integration Systems: A Mass Collaboration Approach, A. Doan and R. McCann. Proc. of the IJCAI-03 Workshop on Information Integration on the Web.

Earlier Related Work

OLAP Over Uncertain and Imprecise Data, D. Burdick, P. Deshpande, T. Jayram, R. Ramakrishnan, and S. Vaithyanathan. VLDB-05.
Mass Collaboration: A Case Study, R. Ramakrishnan, A. Baptist, V. Ercegovac, M. Hanselman, N. Kabra, A. Marathe, U. Shaft. Int. Database Engineering and Applications Symposium (IDEAS-04).
The QUIQ Engine: A Hybrid IR-DB System, N. Kabra, R. Ramakrishnan, V. Ercegovac. ICDE-03.
The QUIQ Engine: A Hybrid IR-DB System, N. Kabra, R. Ramakrishnan, and V. Ercegovac. UW Tech. Report TR-1449, 2002.
Mass Collaboration and Data Mining (PPT slides), R. Ramakrishnan. Keynote talk, KDD-01.

Last updated: June 2007.