Research Statement
AnHai Doan

The rapid spread of computers and communication networks has transformed our world into a vast information bazaar, with millions of sources providing data in every imaginable format and mode of interaction. Distributed information processing systems hold the promise of acting as crucial middlemen in this chaotic market, by interacting with data sources, translating, and combining their data in order to obtain the information requested by users. However, today this promise remains largely unfulfilled, because such systems are still very hard to build and costly to operate. They must be told in tedious detail how to interact with the data sources and understand the languages they use. With the sources in constant evolution, once deployed the systems must still be under continuous supervision and told how to deal with the changes. The laborious teaching and supervision incur huge costs and severely limit the deployment of such ``middleman'' systems in practice. As a consequence, the vast potential of the global information market has so far remained largely untapped.

My research seeks to unleash this potential by making distributed information processing systems much easier to use with far less need for human supervision. My ultimate goal is to achieve the widespread use of online information processing systems that take only minutes to be deployed (instead of weeks or months as is the case today), that require only minimal human coaching to rapidly reach and maintain competence, and that continuously improve over time, in terms of both performance and capabilities.

Toward this goal, I begin by studying an important and representative class of distributed information processing systems: data integration ones. Such systems provide a uniform query interface to a multitude of data sources, thereby freeing the users from the tedious job of manually selecting the relevant sources, querying them, and combining their data to obtain the answers. Most recent research (including some of my own [9,10]) has addressed only the modeling and query processing aspects of data integration systems. I now plan to apply techniques from a variety of fields - most notably databases and machine learning - to significantly reduce the complexity of building and managing such systems. Specifically, I will focus on the following research areas:

Learning Source Descriptions
To process user queries, a data integration system must know the descriptions of data sources. Today, such descriptions are created manually. I will develop techniques to automatically learn source descriptions. Toward this goal, my Ph.D. thesis has addressed the problem of learning the semantic mappings between a source schema and the query interface of the system [2,5,4]. The thesis shows that machine learning techniques can be applied to produce such mappings with high accuracy. Specifically, it describes a multi-strategy learning approach that applies multiple learners to predict mappings, then combines the learners' predictions using a meta-learner. This approach subsumes and generalizes most previous approaches on schema mapping, which employ only a single mapping strategy. The thesis also shows that data instances can be utilized effectively to generate mappings. This is in contrast to most previous works, which utilize only schema information.

While research to date on schema mapping has been very promising, it is only the first step. None of the current approaches can automatically learn the complex, non one-to-one mappings that occur frequently in practice. I have already begun to investigate this issue [3]. A second challenge is to develop well-founded notions for semantic mapping. Such notions should help to communicate the meaning of mappings and to leverage specialized techniques for the mapping process. A third challenge is to develop a unified framework for schema mapping that combines in a principled, seamless, and efficient way all the relevant information (e.g., user feedback, mappings from a different application) and techniques (e.g., machine learning, heuristics). My recent work [11] suggests that mappings can be given well-founded definitions based on probabilistic interpretations, and that a unified mapping framework can be developed by leveraging probabilistic representation and reasoning methods such as Bayesian networks. I plan to further investigate these issues. Learning other source characteristics such as schema, reliability, and query-processing capability also raises many fascinating challenges that I plan to pursue. For example, suppose we want to build a data integration system over all C.S. department web sites in the U.S. What would be the schema of such a department web site? How do we characterize and learn it automatically? I believe this problem can be cast as a schema mapping problem, and hence it will benefit from the mapping techniques that I have developed.

Dealing with Changes in Source Descriptions
In dynamic and autonomous environments (e.g., the Internet) sources often undergo modifications with respect to schema, data, and query-processing capabilities. Hence, the operators of a data integration system must constantly monitor the component sources to detect and deal with changes. Clearly, manual monitoring is very expensive and not scalable. My goal therefore is to develop techniques to automate the monitoring and updating process. I believe an effective solution to the monitoring problem is to sample source data periodically, then use machine learning techniques to compare the current sample with previous source samples to detect changes. For example, the problem of detecting if the semantic mappings are still valid can be recast as a schema mapping problem that involves two consecutive source samples. A key challenge in the updating process will be updating the system's query interface to reflect changes in a source schema. This problem is similar to schema integration, a well-known and difficult problem. Several areas in databases and AI, including schema management, ontology merging, and model management, have addressed different aspects of this problem. I plan to build on techniques in these areas to develop an effective solution for updating.

Matching Objects across Sources
The problem of deciding if two objects in two sources refer to the same real-world entity lies at the heart of the data integration enterprise. Previous solutions to this problem are unsatisfactory: they are largely ad-hoc and as a result have limited applicability. I believe this problem has close resemblances with the schema mapping problem. Hence, I would like to develop a solution to object matching that builds upon recent advances in schema mapping (including my own work [2,11]) and utilizes techniques from machine learning and probabilistic reasoning.

Incorporating User Feedback
User feedback is critical to many tasks during the construction and maintenance of a data integration system, because of the inherent subjectivity of the tasks and the imperfection of learning techniques. My experience in dealing with user feedback [2] suggests that, unless handled properly, it can quickly become a serious bottleneck in building and maintaining a system. My goal therefore is to develop techniques to minimize necessary user feedback while maximizing the impact of the feedback. My approach is to build a single feedback loop from the user to the system, instead of a separate loop from the user to each of the tasks that requires user supervision. Users would give feedback only on the correctness of the answers to a selected set of queries. The system would then use the feedback to verify the correctness of the semantic mappings, the source schemas, and so on. I will investigate techniques from active learning and intelligent user interfaces for this purpose.
I plan to validate the above research ideas by applying them to the construction and maintenance of large-scale data integration systems on the Internet and in specific application domains, such as medicine, astronomy, and biology. To ensure the success of this project, I intend to work closely with researchers in the application domains, drawing on my substantial experience in interdisciplinary collaboration ( applying AI planning techniques [14,1,6,13,12] to medical diagnosis problems [15,8,7,16]).

While the above research agenda focuses on data integration, it should have implications well beyond that context. Many of the problems that it investigates, such as schema mapping, object identification, schema integration, and user interaction, are fundamental issues in numerous data management and data mining applications. The agenda also necessitates a strong emphasis on extending current machine learning techniques to deal with novel learning problems brought about by data integration. For example, my work on schema mapping has extended classification methods to handle semi-structured data [2], and developed an efficient technique based on relaxation labeling to classify entities that are interrelated in complex ways [11]. Such techniques should also find application in many databases and machine learning problems. Hence, in parallel with pursuing my research on data integration, I also plan to investigate its implications for other problems. For example, I have applied my thesis work to the problem of translating between ontologies on the Semantic Web [11], and plan to apply it to the problem of information extraction from text.

Further into the future, I intend to build on my work in data integration to investigate systems with more sophisticated capabilities, such as those that integrate online services, and those that perform peer-to-peer data sharing. Such distributed information processing systems should play an important role in transforming the global information bazaar into a vast knowledge base for humankind, unleashing a revolution of new possibilities. The goal of my research is to see this vision realized.

Bibliography

1
A. Doan.
Modeling probabilistic actions for practical decision-theoretic planning.
In Proc. of the 3rd Int. Conference on AI Planning Systems (AIPS), 1996.

2
A. Doan, P. Domingos, and A. Halevy.
Reconciling schemas of disparate data sources: A machine learning approach.
In Proc. of the ACM Conference on Management of Data (SIGMOD), 2001.

3
A. Doan, P. Domingos, and A. Halevy.
Learning complex mappings between database schemas. 2002.
To be submitted to the Conference on Very Large Databases (VLDB).

4
A. Doan, P. Domingos, and A. Levy.
Learning mappings between data schemas.
In Proc. of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000.

5
A. Doan, P. Domingos, and A. Levy.
Learning source descriptions for data integration.
In Proc. of the Third Int. Workshop on the Web and Databases (WebDB), 2000.

6
A. Doan and P. Haddawy.
Sound abstraction of probabilistic actions in the constraint mass assigment framework.
In Proc. of the 12th Nat. Conference on Uncertainty in AI (UAI), 1996.

7
A. Doan, P. Haddawy, and C. Kahn.
Decision-theoretic refinement planning: A new method for clinical decision analysis.
In Proc. of the 19th AMIA Annual Symposium on Computer Applications in Medical Care (SCAMC), 1995.

8
A. Doan, P. Haddawy, and C. Kahn.
Decision-theoretic planning for clinical decision analysis.
In Proc. of the Annual AI in Medicine Spring Symposium, 1996.

9
A. Doan and A. Halevy.
Efficiently ordering query plans for data integration.
In Proc. of the 18th IEEE Int. Conference on Data Engineering (ICDE), 2002. To appear.

10
A. Doan and A. Levy.
Efficiently ordering query plans for data integration.
In Proc. of the IJCAI-99 Workshop on Intelligent Information Integration, 1999.

11
A. Doan, J. Madhavan, P. Domingos, and A. Halevy.
Learning to map between ontologies on the semantic web. 2002.
Submitted to the World-Wide Web Conference (WWW).

12
V. Ha, A. Doan, V. Vu, and P. Haddawy.
Geometric foundations for interval-based probabilities.
Annals of Mathematics and Artificial Intelligence, 24, 1998.

13
P. Haddawy and A. Doan.
Abstracting probabilistic actions.
In Proc. of the 10th Conference on Uncertainty in AI (UAI), 1994.

14
P. Haddawy, A. Doan, and R. Goodwin.
Efficient decision-theoretic planning: Techniques and empirical analysis.
In Proc. of the 11th Nat. Conference on Uncertainty in AI (UAI), 1995.

15
P. Haddawy, A. Doan, and C.E. Kahn.
Decision-theoretic refinement planning in medical decision making: Management of acute deep venous thrombosis.
Journal of Medical Decision Making, 1996.

16
C. Kahn, A. Doan, and P. Haddawy.
Management of acute deep venous thrombosis of the lower extremities (abstract).
In American Roentgen Ray Society Meeting, 1996.