Under construction. Last updated: Aug 05.
Schema and ontology matching play a fundamental role in numerous data management applications. My Ph.D. research (SIGMOD-01,WWW-02) (2000-2002) made three key contributions to solving this problem. First, it introduced machine learning as an indispensable component of matching solutions. Second, it articulated a multi-component, highly extensible architecture for schema matching. Third, it showed how to learn from past matching efforts (to improve accuracy of subsequent matching tasks).
In my post-Ph.D. research on schema & ontology matching (2003-date), I have significantly extended the above three directions. In particular, I show that the multi-component architecture is also well suited for discovering complex semantic matches (SIGMOD-04a) and that beyond past matching efforts, one can also learn from external data, other related matching tasks, corpora of schemas, and users (SIGMOD-04b, SIGMOD-04a, ICDE-05a, WebDB-03).
Novel Directions: At the same time, I also move on to two novel and important schema matching challenges: tuning matching systems and designing schemas for interoperability.
My recent work (VLDB-05a) with several colleagues was the first to articulate the tuning problem. It also developed eTuner, a solution to automatically tuning schema matching systems, at virtually no cost to the user. The developed solution also has applicability well beyond the schema matching context (VLDB-05b).
Self-Managing Data Integration Systems: Since a key application of schema matching is data integration, I became interested in the topic early. (In retrospect, I also had no choice, given Alon being my boss at the time :). A popular integration architecture is to query over a mediated schema, which then translates and sends the query to a set of sources. Building such a data integration system is extremely labor intensive. Hence I became interested in making such systems self-managing as much as possible, along the line of autonomic systems.
Clearly, a first step in this direction is to (semi)automate as many integration tasks as possible. Schema matching was one such task that I have worked on. Since 2003, I have also been working on automating other tasks, including the construction of mediated schema (SIGMOD-04b,ICDM-05), and tuple deduplication (AAAI-05, Tech Report 05 on Mediate).
Once an integration system has been built, it must be maintained over time, as the environment changes. In fact, since the maintenance cost often dominates in the long run, automating maintenance tasks become crucial. My first work in this direction is to maintain semantic mappings/wrappers over time, as the underlying sources evolve (VLDB-05b). I am currently exploring more problems along this line.
Hard vs. Soft Data Integration: In enterprise contexts, integration systems such as those described above must be "precise" and kept "precise" over time. Otherwise they are pretty much useless. I call this hard data integration. A quintessential example of this is expedia.com, which integrate numerous sources on plane tickets, hotels, cars, etc. Hard data integration is very expensive, but crucial for business, and is amortized costwise over long-running information needs.
Recently, however, there have also been growing interests in what I call soft data integration: scenarios where the integration need is short term (e.g., asking only a few queries), or must be satisfied quickly (so cannot wait months to set up a "hard" data integration system), or can be satisfied without building expensive "hard" integration systems. Variants of such scenarios are referred to as on-the-fly, approximate, hands-off, or best-effort integration. A prime example of such "soft" systems is Citeseer, others include personal information management systems, and systems that manage scientific data in numerous sciences.
Besides my work on hard data integration systems, as described above, I have also started research on soft integration systems, as it becomes increasingly clear that such systems play a critical roles in many applications. The key challenges are to develop conceptual models and efficient algorithms where automatic tools do their best, then engage users to help "cover the last miles". I believe this direction will benefit significantly from combination of database, IR, and learning techniques.
Community Information Management
Traditional data management is about managing data, largely in isolation. My final direction is on managing data in a synergistic manner with its users. As more and more user communities appear online, this will become crucially important. My work here is still preliminary, but more will come soon. An early project of mine (MOBS) is on collaboratively building data integration systems (WebDB-03, ICDE-05b). The approach is reminiscent of mass collaboration efforts to build software such as Linux and Apache. Thus, it employs both learning and methods from open-source areas.