Dissertation: Learning to Map between Structured Representations of Data

University of Washington-Seattle, 2002.

This dissertation studies representation matching: the problem of creating semantic mappings between two data representations. Examples of data representations are relational schemas, ontologies, and XML DTDs. Examples of semantic mappings include ``element location of one representation maps to element address of the other'', ``contact-phone maps to agent-phone'', and ``listed-price maps to price * (1 + tax-rate)''.

Representation matching lies at the heart of a broad range of information management applications. Virtually any application that manipulates data in different representation formats must establish semantic mappings between the representations, to ensure interoperability. Prime examples of such applications arise in data integration, data warehousing, data mining, e-commerce, bioinformatics, knowledge-base construction, information processing on the World-Wide Web and on the emerging Semantic Web. Today, representation matching is still mainly conducted by hand, in an extremely labor-intensive and error-prone process. The prohibitive cost of representation matching has now become a key bottleneck hindering the deployment of a wide variety of information management applications.

In this dissertation we describe solutions for semi-automatically creating semantic mappings. We describe three systems that deal with successively more expressive data representations and mapping classes. Two systems, LSD and GLUE, find one-to-one mappings such as ``address = location'' in the context of data integration and ontology matching, respectively. The third system, COMAP, finds more complex mappings such as ``name = concatenation(first-name,last-name)''. The key idea underlying these three systems is the incorporation of multiple types of knowledge and multiple machine learning techniques into all stages of the mapping process, with the goal of maximizing mapping accuracy. I present experiments on real-world data that validate the proposed solutions. Finally, we discuss how the solutions generalize previous works in databases and AI on creating semantic mappings.