Overview
The CIM Problem:
The Web is teeming with communities, each focusing on a specific set of topics.
Examples include those of movie goers, football fans, database researchers,
and bioinformatists. Community members often want to aggregate community data,
then query, monitor, and discover certain information. For example,
database researchers might be interested in questions such as "is there
any interesting connection between researchers X and Y (or topic U and V)?",
"in which course is a paper P cited?", and "what is new in the past 24 hours
in the database community?". As Web communities proliferate, the problem of
developing effective solutions to support their information needs is
becoming increasingly important. We call this problem community
information management, or CIM for short.
The Cimple Project: Cimple is a joint project between the Univ. of Wisconsin-Madison
and Yahoo! Research. It develops a software platform that can be rapidly deployed and
customized to manage data-rich online communities. This software platform can be
valuable for communities in a broad range of domains, ranging from scientific data
management, government agencies, and enterprise intranets, to those on the World-Wide Web.
"Cimple" thus is shorthand for "Community Information Management Platform". It consists of four thrusts:
- Extracting and integrating structure: How to efficiently and accurately
extract structured data (e.g., people names, paper titles, conferences, etc.) from the
raw data? How to group them into real-world entities? How to enable system builders and users
to quickly incorporate domain knowledge? How to manage provenance and uncertainties in these
contexts?
- Maintaining structure: How to maintain the extracted structured data
and associated provenance over time,
as the underlying raw data evolves?
- Exploiting structure: How to provide useful user
services over the raw and extracted data? Example services include
keyword search, structured (e.g., SQL) queries, browsing, mining,
monitoring/tracking, and personalized and context-sensitive services.
- Mass collaboration: The extracted structures are inherently imperfect.
How to leverage the multitude of community members to improve the extracted structures,
by providing either implicit or explicit feedback? How to "engineer" the system in
a way that provides incentives for community members to help improve community data?
How to make the system manage both community data and users in a synergistic manner?
DBLife - A Prototype System: To drive and validate Cimple, we
are building DBLife, a prototype system that manages information for the database
research community. . Eventually
we may want to build more prototypes for research communities, such as AILife and IRLife,
as well as non-research ones, such as those for the legal community and the community of
movie goers.
Why This Project?: What is new, and how does it relate to
databases, Web, AI, IR, data integration, and text management?
Acknowledgments: This project is funded by
CAREER award IIS-0347903, a gift grant from Yahoo! Research, an
IBM Faculty Award, and a Sloan Fellowship.
People
- Faculty: AnHai Doan
- Students: Pedro DeRose,
Warren Shen,
Xiaoyong Chai,
Fei Chen
- Postdoc: Byron J. Gao
- Collaborators:
Philip Bohannon (Yahoo! Research),
Luis Gravano (Columbia University),
Jeff Naughton (Wisconsin-Madison),
Raghu Ramakrishnan (Yahoo! Research),
Shivakumar Vaithyanathan (IBM Almaden Research Center),
Jun Yang (Duke University)
- Alumni:
Doug Burdick,
Robert McCann,
Mayssam Sayyadian,
Yoonkyong Lee
Talks
Systems/Sofware/Data
Publications
Overviews, Tutorials, and Demos
- DBLife: A Community Information Management Platform for
the Database Research Community, P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick,
A. Doan, R. Ramakrishnan. CIDR-07 (demo).
- Community Information
Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee,
R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering
Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006.
- Managing Information
Extraction, A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD-06 Tutorial (PPT slides).
See also the tutorial description in SIGMOD
proceedings.
System Methodologies and Architecture
Structure Extraction and Integration
- Efficient Information Extraction over Evolving Text Data, F. Chen,
A. Doan, J. Yang, R. Ramakrishnan, ICDE-08.
- Optimizing SQL Queries over Text Databases, A. Jain,
A. Doan, L. Gravano. ICDE-08.
- Declarative Information Extraction Using
Datalog with Embedded Extraction Predicates, W. Shen,
A. Doan, J. Naughton, R. Ramakrishnan. VLDB-07.
- A Relational Approach to Incrementally
Extracting and Querying Structure in Unstructured Data, E. Chu, A. Baid, T. Chen,
A. Doan, J. Naughton. VLDB-07.
- OLAP over Imprecise Data with
Domain Constraints, D. Burdick, A. Doan, R. Ramakrishnan,
S. Vaithyanathan. VLDB-07.
- Souce-aware Entity Matching: A Compositional Approach,
W. Shen, P. DeRose, L. Vu, A. Doan, R. Ramakrishnan. ICDE-07.
Structure Maintenance
- Efficient Information Extraction over Evolving Text Data, F. Chen,
A. Doan, J. Yang, R. Ramakrishnan, ICDE-08.
- Maveric: Mapping
Maintenance for Data Integration Systems, R. McCann, B.
AlShelbi, Q. Le, H. Nguyen, L. Vu, A. Doan. VLDB-05.
PPT slides.
Structure Exploitation
Mass Collaboration and User-Centric Challenges
- Building Community Wikipedias: A Human-Machine Approach,
P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, J. Zhu. ICDE-08.
- User-Centric Research Challenges in Community
Information Management Systems, A. Doan, P. Bohannon, R. Ramakrishnan, X. Chai, P. DeRose,
B. Gao, W. Shen. IEEE Data Engineering Bulletin, special issue on data management in social
networks. 2007, invited.
- Web 2.0 Style Schema Matching, R. McCann, W. Shen,
A. Kramnik, A. Doan. ICDE-08.
- Learning from Multiple Users to Improve
Data Integration Tasks, R. McCann, W. Shen, A. Doan. Tech.
Report, 2006.
- Integrating Data from Disparate
Sources: A Mass Collaboration Approach, R. McCann, A. Kramnik,
W. Shen, V. Varadarajan, O. Sobulo,
A. Doan. ICDE-05. Poster.
- Building Data Integration
Systems via Mass Collaboration, R. McCann, A. Doan,
A. Kramnik, and V. Varadarajan. WebDB-03
Presentation
- Building Data
Integration Systems: A Mass Collaboration Approach, A. Doan
and R. McCann.
Proc. of the IJCAI-03 Workshop on Information Integration on the
Web.
Earlier Related Work
- OLAP Over Uncertain and Imprecise Data,
D. Burdick, P. Deshpande, T. Jayram, R. Ramakrishnan, and
S. Vaithyanathan. VLDB-05.
- Mass Collaboration: A Case Study,
R. Ramakrishnan, A. Baptist, V. Ercegovac, M. Hanselman, N. Kabra,
A. Marathe, U. Shaft. Int. Database Engineering and Applications
Symposium (IDEAS-04).
- The QUIQ Engine: A Hybrid IR-DB System,
N. Kabra, R. Ramakrishnan, V. Ercegovac. ICDE-03.
- The QUIQ Engine: A Hybrid IR-DB System,
N. Kabra, R. Ramakrishnan, and V. Ercegovac. UW Tech. Report TR-1449, 2002.
- Mass Collaboration and Data Mining (PPT slides),
R. Ramakrishnan. Keynote talk, KDD-01.
Last updated: June 2007.