Read an Excerpt
7
Topic Maps and XTM
Topic maps are designed to facilitate the management and navigation of large quantities of information. They achieve this by providing a "navigation layer" over a set of information resources that is independent of the form and format of the resources themselves. A topic map consists of topics (which are information objects representing specific subjects of interest), associations (which represent relationships among those subjects), and occurrences (information resources that are relevant to those subjects in any way). Through these simple constructs, the topic map enables the navigation of information resources based on their subject matter, and the relationships between various subjects, rather than on their format, structure, or even specific content.A topic map may be a virtual object, existing in memory within a topic map application, or it may be made persistent for the purposes of storage or exchange. The persistent form of a topic map can be an XML document conforming to the XTM (XML Topic Maps) document type definition. Because the topics, associations, and occurrence information are carried in a document that is external to the resources themselves, the topic map can be constructed without editing or touching the resources in any way. This means that anybody can create a topic map, without requiring read-write access to the files that carry the information the topic map relates to. It is the topic map author who determines what are the subjects and relationships that are of interest, and which specific resources are occurrences of each topic. In this sense, a topic map expresses someone's conceptualization and categorization of an information set.
Any number of different topic maps can be created for a single set of resources. Furthermore, topic maps can be merged, which means that responsibility for developing a topic map can be delegated to several different people, each of whom produces a partial topic map and makes a contribution to the composite world view that the merged topic map represents. In this way, topic maps can be used as an invaluable repository of "corporate knowledge". Topic maps thus provide a powerful bridge between knowledge representation and information management. They enable the information carried by a set of information resources to be categorized and structured, and also enriched by the implicit knowledge that went into the identification of the topics themselves and the relationships between them.
If the primary purpose of topic maps is to facilitate the management, organization, navigation, and retrieval of information from large pools of disparate interconnected resources, a number of subsidiary purposes can be identified, including:
- Supporting diversity of language, terminology and viewpoint, while still allowing the common meaning of information to be traceable.
- Breaking down barriers to information access by enabling information retrieval across information resources independently of format.
- Providing a robust but flexible underpinning for the creation of indexes and thesauri, in order to help users find the information they need.
- Allowing information from different sources to be meaningfully brought together and merged, while making sense of the relationships between the information contained therein.
- What topic maps are and how they can be used.
- The XML topic map syntax, XTM.
- Creating, processing, and merging topic maps.
How Topic Maps Achieve their Purpose
In order to provide examples of some of the concepts we will be considering in this chapter, we shall draw on a short passage from a newspaper article (from the English newspaper, The Sunday Times of 6 May 2001). The passage is as follows:
- "Newly unearthed drawings have shown that Charles Rennie Mackintosh, one of Britain's greatest architects of the early 20th century, made plans for a dome in Glasgow resembling the illfated attraction in Greenwich that closed at the end of last year. The designs by Mackintosh, who died in 1928, lay neglected in an archive at Glasgow University until they were rediscovered earlier this year. They have surprised the Millennium Dome's architects. "
There are a number of approaches that have traditionally been applied to this situation. One is the provision of a full-text search engine, allowing users to find the article based on the words it contains. Another is to categorize the article according to its principal themes and build a subject index in which we will find this article referenced from each of those themes. A third approach is to attach meta data tags to the article itself, and build a search engine that queries the content of those tags.
The topic map approach is to identify the subjects of interest within the passage, and to build a map containing a topic representing each of those subjects of interest. The map is then enriched with further objects, known as associations, representing relationships between the subjects. The article itself is considered to be an occurrence of each of the topics drawn from it. When the same subject occurs in several articles, there need only be one topic object representing that subject, and all the relevant articles will be identified as occurrences of that one topic.
This mechanism makes it possible to travel from a particular article to any of the topics of which it is an occurrence, then via associations involving that topic to other topics that are related to it in any way, and then to other articles that are occurrences of those topics. As an example of such navigation, suppose a reader notes from the article above that the Millennium Dome (referred to above as the "illfated attraction") is in Greenwich. They might then discover an association in the topic map indicating that Greenwich is a borough within London, and look for articles that are occurrences of the London topic. Or they might go further and look for associations involving architecture and London, and this might take them to the topic for the British Library, and then to occurrences of that topic, which will be articles about the British Library. This navigation scenario does not rely on there being any words in common between the original article and the one about the British Library, nor on there being any hyperlinks between the two articles, nor on the articles having been categorized in the same way in any index. It is sufficient that there is a chain of relationships, captured in association objects within the topic map, that connects from the Millennium Dome to the British Library (in this case via Greenwich, London, and architecture).
Reification
We have said that a topic is an information object that represents a subject. The topic exists within the computer. It can be manipulated and interpreted by software that is designed to handle information objects of that type. But what the topic represents will in many cases not be accessible to the computer at all. It is a subject of interest, such as Charles Rennie Mackintosh, the University of Glasgow, the recently discovered drawings of a design for a dome, or the year 1928 (in which, we are told in our sample article, Charles Rennie Mackintosh died).
If the drawings of the dome design had been produced in 1998, they might well have been in an electronic form, and then they would indeed be directly accessible to a computer system, and software designed to handle information objects of that type would be able to manipulate and interpret them. A set of drawings in electronic form is directly accessible to, and addressable by, a computer system. However, a set of drawings on paper, a person, a building, or a calendar year, is not directly accessible to the computer. Things that are directly accessible to a computer system are known as resources. Everything else in the world - people, places, physical objects, organizations, abstract concepts, and so on, are known in the language of topic maps as non-addressable subjects. There is a world of difference between resources and non-addressable subjects, but these two worlds - the computable and the noncomputable - are bridged within the topic map by the fact that both may be represented by the same kind of information object - the topic.
The topic is said to reify (make real) its subject. It makes the subject accessible to the computer and thus enables the computer system to manipulate it in various ways. Once we have created a topic object to reify a subject, we can associate additional information with the topic object, linking it with other such objects in the complex pool of information that constitutes our topic map, and whose structure reflects the structure of those aspects of the world we wish our topic map to convey.
There are five things we can add to a topic object in order to allow it to be used and manipulated in useful ways:
- We can assign one or more names to it.
- We can identify resources that are its occurrences.
- We can identify the relationships that it has with other topics, and the role that it plays in these relationships.
- In the special case where the topic's subject is a resource, we can identify the resource that is the subject of the topic.
- Whether the topic has a resource as its subject, or in the more usual situation where the topic has a non-addressable subject, we can identify one or more resources that act as subject indicators for the topic. A subject indicator is defined as a resource (an object directly accessible to the computer) whose content indicates what the subject of the topic is.
The last two items in the above list deal with the topic identity. They provide the bridge between the topic map and the subjects that the topics represent. It is these that make the topics intelligible to human beings, or to computer systems beyond the topic map system itself.