Broader Contexts for the Cimple Project


What is New?

Cimple builds on the wealth of research in information extraction, information integration, entity matching, and text management (that spans the database, AI, Web, and IR communities), but differs from this body of work in five important ways:

How does Cimple Relate to Research in ...

Databases/IR: Cimple attempts to extend the footprints of DBMSs and more broadly apply database technologies to manage Web data. A major problem with doing database and IR research that involve the Web is that the Web is simply too big. In academic environments, it is possible, but difficult to build infrastructures and user bases at the Web scale to uncover more interesting problems and better validate solutions. Cimple can be viewed as attempting to circumvent this problem by focusing on Web communities, which are in effect "mini-Web". At this scale it may be the case that it is easier to build infrastructures and user bases, to perform deeper semantic analysis to infer more complex structured data, and to apply database/IR technologies.

Data Integration: Cimple is an attempt to do best-effort data integration at a community scale: first we apply the best automatic techniques to extract and integrate the data, then we leverage human effort (from the community builders and the users) to improve the extraction and integration process. This can be viewed as an example of self-improving automatic data integration systems. In addition, it can be interesting to consider if the problem of building a DBLife-like system for the database community can be cast as a data integration challenge (and benchmark).

AI: The AI community has developed numerous sophisticated solutions to address individual problems in the CIM process, such as information extraction, entity matching, and relationship discovery. The main focus has been largely on improving accuracies. Cimple also attempts to develop more accurate "blackbox" solutions in the CIM context. But it places a major emphasis on studying how the "blackboxes" can be composed effectively to handle the entire CIM process. The focus is on composing to maximize accuracy as well as efficiency (since scalability is a major problem). To compose "blackboxes" effectively, we often take cues from machine learning techniques as well as the relational optimization technologies.

Web: Cimple can be viewed as building technologies for vertical portals, but at the semantic (e.g., entity-relationship) level. It moves toward a vision of the Web where numerous such community portals exist, each of which can be maintained efficiently with minimal human effort, and where Web search can be moved to the next level by exploiting structured data at the community portals. Cimple also studies the problem of how to make community members collectively help build and maintain such portals. (Industrial efforts toward this direction can be seen, e.g., in the case of My Web 2.0 at Yahoo! and Google Base.)

Semantic Web: The vision of the Semantic Web is to have users mark up data on the Web so that it can be exploited more effectively by automated means. Cimple studies how this can be done in the context of communities: how we can "bootstrap" a portion of Semantic Web by initially marking up data using automatic means, then using services over this initial markup to entice the user base to mark up more data, thereby improving the provided services.


Last updated: Apr 2006.