XML Data Management: Native XML and XML- Enabled Database Systems

Paperback (Print)
Buy New
Buy New from BN.com
$35.88
Used and New from Other Sellers
Used and New from Other Sellers
from $1.99
Usually ships in 1-2 business days
(Save 96%)
Other sellers (Paperback)
  • All (12) from $1.99   
  • New (5) from $28.01   
  • Used (7) from $1.99   

Overview

"This is an excellent book that combines a practical and analytical look at the subject."

—Leo Korman, Principal Software Engineer, KANA Software

As organizations begin to employ XML within their information-management and exchange strategies, data management issues pertaining to storage, retrieval, querying, indexing, and manipulation increasingly arise. Moreover, new information-modeling challenges also appear. XML Data Management —with its contributions from experts at the forefront of the XML field—addresses these key issues and challenges, offering insights into the advantages and drawbacks of various XML solutions, best practices for modeling information with XML, and developing custom, in-house solutions.

In this book, you will find discussions on the newest native XML databases, along with information on working with XML-enabled relational database systems. In addition, XML Data Management thoroughly examines benchmarks and analysis techniques for performance of XML databases.

Topics covered include:

  • The power of good grammar and style in modeling information to alleviate the need for redundant domain knowledge
  • Tamino's XML storage, indexing, querying, and data access features
  • The features and APIs of open source eXist
  • Berkeley DB XML's ability to store XML documents natively
  • IBM's DB2 Universal Database and its support for XML applications
  • Xperanto's method of addressing information integration requirements
  • Oracle's XMLType for managing document centric XML documents
  • Microsoft SQL Server 2000's support for exporting and importing XML data
  • A generic architecture for storing XML documents in a relational database
  • X007, XMach-1, XMark, and other benchmarks for evaluating XML database performance

Numerous case studies demonstrate real-world problems, industry-tested solutions, and creative applications of XML data management solutions.

Written for both XML and relational database professionals, XML Data Management provides a promising new approach to data management, one that is sure to positively impact the way organizations manage and exchange information.

0201844524B01302003

Read More Show Less

Product Details

  • ISBN-13: 9780201844528
  • Publisher: Addison-Wesley
  • Publication date: 3/7/2003
  • Pages: 688
  • Product dimensions: 7.32 (w) x 8.97 (h) x 1.65 (d)

Meet the Author

Akmal B. Chaudhri works for IBM developerWorks, where he is also Zone Editor for Special Projects. A recognized authority on objects and databases, he has been a regular presenter at many international conferences, including OOPSLA and Object World. In addition, he has edited several books on these topics.

Awais Rashid is a Lecturer in the Computing Department of Lancaster University in the U.K. where he leads research into the application of new technologies, such as XML and aspect-oriented programming, and database systems. He has actively published on these topics and has organized a number of relevant international events.

Roberto Zicari is a full Professor for Databases and Information Systems at the Johann Wolfgang Goethe University in Frankfurt/Main, Germany. He is an internationally recognized expert in Object Technology. He has consulted and lectured in Europe, North America, and Japan.

0201844524AB01312003

Read More Show Less

Read an Excerpt

The past few years have seen a dramatic increase in the popularity and adoption of

Consider, for instance, the simple HTML document in Listing P.1. The data contained in the document is intertwined with information about its presentation. In fact, the tags describe only how the data is to be formatted. There is no semantic information that the data represents a person's name and address. Consequently, an interpreter cannot make any sound judgments about the semantics as the tags could as well have enclosed information about a car and its parts. Systems such as WIRE (Aggarwal et al. 1998) can interpret the information by using search templates based on the structure of HTML files and the importance of information enclosed in tags defining headings and so forth. However, such interpretation lacks soundness, and its accuracy is context dependent.Listing P.1 An HTML Document with Data about a Person<html>
<head>

<title>Person Information</title>
</head>
<body>
<p> <b>Name: </b>John Doe</p>
<p> <b>Address: </b>10 Church Street, Lancaster LAX 2YZ,
UK</p>
</body>
</html>

Dynamic Web pages, where the data resides in a backend database and is served using predefined templates, reduce the coupling between the data and its representation. However, the semantics of the data can still be confusing when exchanging information in an e-business environment. A particular item could be represented using different names (in the simplest case) in two systems in a business-to-business transaction. This enforces adherence to complex, oftenproprietary, document standards.
This preface introduces the basics of
P.1 What Is
Listing P.2 An
<?
<person>
<name>
<surname>Doe</surname>
<firstname>John</firstname>
</name>
<address>
<housenumber>10</housenumber>
<street>Church Street</street>
<town>Lancaster</town>
<postcode>LAX 2YZ</postcode>
<country>UK</country>
</address>
</person>

Unlike the HTML document in Listing P.1, the document in Listing P.2 contains only the data about the person and no representational information. The data and its meaning can be read from the document and the document formatted in a range of fashions as desired. One standard approach is to use XSL, the eXtensible Stylesheet Language.
The flexible nature of
P.1.1 Well-Formed and Valid

Although
P.1.2 Data-Centric and Document-Centric

Listing P.3 Data-Centric
<order>
<customer>Doe</customer>
<position>
<isbn>1-234-56789-0</isbn>
<number>2</number>
<price currency="UKP">30.00</price>
</position>
</order>
Listing P.4 Document-Centric
<content>
existing languages, <em>HTML</em>
and <em>SGML</em> to create a simple
mechanism . . .
The generalized markup concept . . .
</content>
P.2
This section provides an overview of basic
P.2.1 DTDs and

Both DTDs and
Listing P.5 shows a DTD for the simple
Listing P.5 A DTD for the Simple
<!ELEMENT person (name, address)>
<!ELEMENT name (surname, firstname)>
<!ELEMENT surname (#PCDATA)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT address (housenumber, street, town, postcode, country)>
<!ELEMENT housenumber (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT town (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT country (#PCDATA)>

Listing P.6 shows an
Listing P.6 An
<?
<xs:schema
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="name">
<xs:complexType>
<xs:sequence>
<xs:element name="surname" type="xs:string"/>
<xs:element name="firstname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="address" minOccurs="0" maxOccurs="1">
<xs:complexType>
<xs:sequence>
<xs:element name="housenumber" type="xs:integer"/>
<xs:element name="street" type="xs:string"/>
<xs:element name="town" type="xs:string"/>
<xs:element name="postcode" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
P.2.2 DOM and SAX
DOM and SAX are the two main APIs for manipulating
SAX, the Simple API for
P.3
This section describes some of the technologies related to
P.3.1 XPathXPath, the
and linking to information contained within an

  1. A hierarchical fashion based on the ordering of elements in a document tree
  2. An arbitrary manner relying on elements in a document tree having unique identifiers

A few example XPath expressions, based on the sample

Listing P.7 Example XPath Expressions1. select="firstname"
2. select="name/surname"
3. match="name address"
P.3.2 XSL

Since an

XSL FO provides formatting and flow semantics for rendering an

Listing P.8 An XSL Style Sheet for the

<?

<xsl:stylesheet

"http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<html>
<head><title>PersonInformation</title></head>

<body>
<xsl:apply-templates select="person/name"/>
</body>
</html>
</xsl:template>
<xsl:template match="name">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="surname">
<p><b><xsl:text>Surname: </xsl:text></b>
<xsl:value-of select="."/></p><br/>
</xsl:template>
<xsl:template match="firstname">
<p><b><xsl:text>First name: </xsl:text></b>
<xsl:value-of select="."/></p>
</xsl:template>
</xsl:stylesheet>
Listing P.9 HTML Resulting from the Transformation in Listing P.8<html>
<head>
<title>Person Information </title>
</head>
<body>
<p>
<b>Surname: </b>Doe
</p>
<br>
<p>
<b>First name: </b>John
</p>
</body>
</html>
P.3.3 SOAP
SOAP is the Simple Object Access Protocol used to invoke code over the Internet using
P.4
So far, we have discussed the basics of
Database vendors have responded to these new data and information management needs. Most commercial relational, object-relational, and object-oriented database systems offer extensions and plug-ins and other mechanisms to support the management of
With the numerous approaches and solutions available in the market, organizations and system developers with
  • What are the various
  • What are the features, services, and tools offered by these different
  • How can an in-house, custom solution be developed instead of using a commercially available system?
  • Which
  • Are there any good practice and domain or application-specific guidelines for information modeling with
  • Are there other examples and applications of

This book is intended to be a support mechanism to address the above challenges. It provides a discussion of the various

P.5 How This Book Is Organized

This book is divided into five parts, each containing a coherent and closely related set of chapters. It should be noted that these parts are self-contained and can be read in any order. The five parts are as follows:Part I: Introduction
Part II: Native

Part III:

Part IV: Applications of

Part V: Performance and Benchmarks
The parts are summarized in the sections that follow.
P.5.1 Part I: Introduction

This part contains a chapter that focuses on guidelines for achieving good grammar and style when modeling information using

P.5.2 Part II: Native

Two native

In a similar fashion, Chapter 3 by Meier introduces the various features and APIs of the Open Source system eXist. However, in contrast with Chapter 2, the main focus is on how query processing works within the system. As a result, the author provides deeper insight into its indexing and storage architectures. Together both chapters offer a balanced discussion, both on high-level application-programming features of the two systems and underlying indexing and storage mechanisms pertaining to efficient query processing.

Finally in Chapter 4, we have included an example of an embedded

P.5.3 Part III:

This part provides an interesting mix of products and approaches to

Chapter 5 by Benham highlights the technology and architecture of

In Chapter 6, Hohenstein discusses similar features in Oracle9i: the use of Oracle's CLOB functionality and OracleText cartridge, for handling data-centric

In Chapter 7, Rys covers a feature set, similar to the ones in Chapters 5 and 6, for MS SQL Server 2000. He focuses on scenarios involving exporting and importing structured

Edwards describes a generic architecture for storing

While Edwards' architecture is aimed at supporting the traditional relational database programmer, Brown's approach seeks to exploit the advanced features offered by the object-relational model and respective extensions of most relational database systems. He discusses object-relational schema design based on introducing into the DBMS core types and operators equivalent to the ones standardized in

P.5.4 Part IV: Applications of

This part presents several applications and case studies in

In Chapter 10, Direen and Jones discuss various challenges in bioinformatics data management and the role of

Kowalski presents two case studies involving

Chapter 12, by Eglin, Hendra, and Pentakalos, describes the design and implementation of the JEDMICS Open Access Interface, an EJB-based API that provides access to image data stored on a variety of storage media and metadata stored in a relational database. The JEDMICS system uses

In Chapter 13, Wilson and her coauthors offer insight into the use of

Rine sketches his vision of an Interstellar Space Wide Web in Chapter 14. He contrasts the issues relating to the development and deployment of such a facility with the problems encountered in today's World Wide Web. He mainly focuses on adapters as configuration mechanisms for large-scale, next-generation distributed systems and as the means to increase the reusability of software components and architectures in this context. His approach to solving the problem is a configuration model and network-aware runtime environment called Space Wide Web Adapter Configuration eXtensible Markup Language (SWWAC

In Chapter 15, Meo and Psaila present an

Chapter 16, the last chapter in this part, describes Baril's and Bellahsene's experiences in designing and managing an

P.5.5 Part V: Performance and Benchmarks

Chapter 17

Read More Show Less

Table of Contents

Preface.

Acknowledgments.

I. WHAT IS XML?

1. Information Modeling with XML.

Introduction.

XML as an Information Domain.

How XML Expresses Information.

Patterns in XML.

Common XML Information-Modeling Pitfalls.

Attributes Used as Data Elements.

Data Elements Used as Metadata.

Inadequate Use of Tags.

A Very Simple Way to Design XML.

Conclusion.

II. NATIVE XML DATABASES.

2. TaminoSoftware AG's Native XML Server.

Introduction.

Tamino Architecture and APIs.

XML Storage.

Collections and Doctypes.

Schemas.

Access to Other DatabasesTamino X-Node.

Mapping Data to FunctionsTamino X-Tension.

Internationalization Issues.

Indexing.

Organization on Disk.

Querying XML.

Query LanguageTamino X-Query.

Sessions and Transactions.

Handling of Results.

Query Execution.

Tools.

Database Browsing.

Schema Editing.

WebDAV Access.

X-Application.

Full Database Functionality.

Conclusion.

3. eXist Native XML Database.

Introduction.

Features.

Schema-less XML Data Store.

Collections.

Index-Based Query Processing.

Extensions for Full-Text Searching.

System Architecture Overview.

Pluggable Storage Backends.

Deployment.

Application Development.

Getting Started.

Query Language Extensions.

Specifying the Input Document Set.

Querying Text.

Outstanding Features.

Application Development.

Programming Java Applications with the XML:DB API.

Accessing eXist with SOAP.

Integration with Cocoon.

Technical Background.

Approaches to Query Execution.

Indexing Scheme.

Index and Storage Implementation.

Query Language Processing.

Query Performance.

Conclusion.

4. Embedded XML Databases.

Introduction.

A Primer on Embedded Databases.

Embedded XML Databases.

Building Applications for Embedded XML Databases.

Overview of Berkeley DB XML.

Configuration.

Indexing and Index Types.

XPath Query Processing.

Programming for Transactions.

Two-Phase Locking and Deadlocks.

Reducing Contention.

Checkpoints.

Recovery Processing after Failures.

Conclusion.

III. XML AND RELATIONAL DATABASES.

5. IBM XML-Enabled Data Management Product Architecture and Technology.

Introduction.

Product and Technology Offering Summaries.

DB2 Universal Database.

Information Integration Technology.

Current Architecture and Technology.

Shared Architecture and Technology.

XML Extender Architecture.

XML Extender Technology.

Using Both XML Collections and XML Columns.

Transforming XML Data.

Searching, Parsing, and Validating XML Data.

XML Extender Federated Support.

SQL XML Support Architecture.

SQL XML Support Technology.

Data Management Web Services Architecture.

Data Management Web Services Technology.

Information Integration-Specific Architecture and Technology.

Future Architecture and Technology.

The Vision.

Application Interface, Data Type, and API Goals.

Storage, Engine, and Data Manager Goals.

Why Support Both XML and Relational Storage in One System?

Why Not Object-Relational Long Term?

Impacted Technology Areas.

Conclusion.

Notices.

6. Supporting XML in Oracle9i.

Introduction.

Storing XML as CLOB.

Using CLOB and the OracleText Cartridge.

Search Predicates in OracleText.

XML-Specific Functionality.

Prerequisites.

XMLType.

Object Type XMLType.

Processing of XMLType in Java.

Using XSU for Fine-Grained Storage.

Canonical Mapping.

Retrieval.

Modifications.

Building XML Documents from Relational Data.

SQL Functions existsNode and extract.

The SQL Function SYS_XMLGen.

The SQL Function SYS_XMLAgg.

PL/SQL Package DBMS_XMLGen.

Web Access to the Database.

The Principle of XSQL.

Posting XML Data into the Database.

Parameterization.

Servlet Invocations.

Special Oracle Features.

URI Support.

Parsers.

Class Generator.

Special Java Beans.

Conclusion.

7. XML Support in Microsoft SQL Server 2000 165

Introduction.

XML and Relational Data.

XML Access to SQL Server.

Access via HTTP.

Using the XML Features through SQLOLEDB, ADO, and .NET.

Serializing SQL Query Results into XML.

The Raw Mode.

The Auto and Nested Modes.

The Explicit Mode.

Providing Relational Views over XML.

SQLXML Templates.

Providing XML Views over Relational Data.

Annotated Schemata.

Querying Using XPath.

Updating Using Updategrams.

Bulk Loading.

Conclusion.

8. A Generic Architecture for Storing XML Documents in a Relational Database.

Introduction.

System Architecture.

Installing Xerces.

The Data Model.

DOM Storage in Relational Databases.

The Nested Sets Model.

Creating the Database.

The Physical Data Model.

Creating User-Defined Data Types.

Creating the Tables.

Serializing a Document out of the Repository.

Building an XML Document Manually.

Connecting to the Repository.

The xmlrepDB Class.

Uploading XML Documents.

The xmlrepSAX Class.

Stored Procedures for Data Entry.

The uploadXML Class.

The extractXML Class.

Querying the Repository.

Ad Hoc SQL Queries.

Searching for Text.

Some More Stored Procedures.

Generating XPath Expressions.

Further Enhancements.

Conclusion.

9. An Object-Relational Approach to Building a High-Performance XML Repository.

Introduction.

Overview of XML Use-Case Scenario.

High-Level System Architecture.

Detailed Design Descriptions.

Conclusion.

IV. APPLICATIONS OF XML.

10. Knowledge Management in Bioinformatics.

Introduction.

A Brief Molecular Biology Background.

Life Sciences Are Turning to XML to Model Their Information.

A Genetic Information Model.

NeoCore XMS.

Integration of BLAST into NeoCore XMS.

Sequence Search Types.

Conclusion.

11. Case Studies of XML Used with IBM DB2 Universal Database.

Introduction.

Case Study 1: “Our Most Valued Customers Come First”.

Company Scenario.

How This Business Problem Is Addressed.

Future Extensions.

Case Study 2: “Improve Cash Flow”.

Company Scenario.

How This Business Problem Is Addressed.

Future Extensions.

Conclusion.

Notices.

12. The Design and Implementation of an Engineering Data Management System Using XML and J2EE.

Introduction.

Background and Requirements.

Overview.

Security Service.

Query Service.

Image Query Service.

Print Service.

Design Choices.

Using XML in OAI.

Conversion of XML Input into Objects.

Conversion of Database Data into XML.

Conversion of Image Data into XML.

Database Access.

Validation.

Future Directions.

XSLT.

Web Services.

Mass Transfer Capability.

Messaging.

Conclusion.

13. Geographical Data Interchange Using XML-Enabled Technology within the GIDB System.

Introduction

GIDB METOC Data Integration.

Background.

Implementation.

GIDB Web Map Service Implementation.

GIDB GML Import and Export.

Conclusion.

14. Space Wide Web by Adapters in Distributed Systems Configuration from Reusable Components.

Introduction.

Advanced Concept Description: The Research Problem.

Future Supporting Communications Satellites Constellations.

Integration of Components with Architecture.

Example.

Future Generation NASA Institute for Advanced Concepts, Space Wide Web Research, and Boundaries.

Advanced Concept Development.

The Research Approach.

The Research Tasks.

Conclusion.

15. XML as a Unifying Framework for Inductive Databases.

Introduction.

Past Work.

Extracting and Evaluating Association Rules.

Classifying Data.

Inductive Databases.

PMML.

The Proposed Data Model: XDM.

Basic Concepts.

Classification with XDM.

Association Rules with XDM.

Benefits of XDM.

Toward Flexible and Open Systems.

Related Work.

Conclusion.

16. Designing and Managing an XML Warehouse.

Introduction.

Why a View Mechanism for XML?

Contributions.

Outline.

Architecture.

Data Warehouse Specification.

View Model for XML Documents.

Graphic Tool for Data Warehouse Specification.

Managing the Metadata.

Data Warehouse.

View Definition.

Mediated Schema Definition.

Storage and Management of the Data Warehouse.

The Different Approaches to Storing XML Data.

Mapping XML to Relational.

View Storage.

Extraction of Data.

DAWAX: A Graphic Tool for the Specification and Management of a Data Warehouse.

Data Warehouse Manager.

The Different DAWAX Packages.

Related Work.

Query Languages for XML.

Storing XML Data.

Systems for XML Data Integration.

Conclusion.

V. PERFORMANCE AND BENCHMARKS.

17. XML Management System Benchmarks.

Introduction.

Benchmark Specification.

Benchmark Data Set.

Benchmark Queries.

Existing Benchmarks for XML.

The XOO7 Benchmark.

The XMach-1 Benchmark.

The XMark Benchmark.

Conclusion.

18. The Michigan Benchmark: A Micro-Benchmark for XML Query Performance Diagnostics.

Introduction.

Related Work.

Benchmark Data Set.

A Discussion of the Data Characteristics.

Schema of Benchmark Data.

Generating the String Attributes and Element Content.

Benchmark Queries.

Selection.

Value-Based Join.

Pointer-Based Join.

Aggregation.

Updates.

Using the Benchmark.

Conclusion.

19. A Comparison of Database Approaches for Storing XML Documents.

Introduction.

Data Models for XML Documents.

The Nontyped DOM Implementation.

The Typed DOM Implementation.

Databases for Storing XML Documents.

Relational Databases.

Object-Oriented Databases.

Directory Servers.

Native XML Databases.

Benchmarking Specification.

Benchmarking a Relational Database.

Benchmarking an Object-Oriented Database.

Benchmarking a Directory Server.

Benchmarking a Native XML Database.

Test Results.

Evaluation of Performance.

Evaluation of Space.

Conclusion.

Related Work.

Studies in Storing and Retrieving XML Documents.

XML and Relational Databases

XML and Object-Relational Databases.

XML and Object-Oriented Databases.

XML and Directory Servers.

Benchmarks for XML Databases.

Guidelines for Benchmarking XML Databases.

Summary.

20. Performance Analysis between an XML-Enabled Database and a Native XML Database.

Introduction.

Related Work.

Methodology.

Database Design.

Discussion.

Experiment Result.

Database Size.

SQL Operations (Single Record).

SQL Operations (Mass Records).

Reporting.

Conclusion.

21. Conclusion.

References.

Contributors.

Editors.

Chapter 1: Information Modeling with XML.

Chapter 2: TaminoSoftware AG's Native XML Server.

Chapter 3: eXist Native XML Database.

Chapter 4: Embedded XML Databases.

Chapter 5: IBM XML-Enabled Data Management Product Architecture and Technology.

Chapter 6: Supporting XML in Oracle9i.

Chapter 7: XML Support in Microsoft SQL Server 2000.

Chapter 8: A Generic Architecture for Storing XML Documents in a Relational Database.

Chapter 9: An Object-Relational Approach to Building a High-Performance XML Repository.

Chapter 10: Knowledge Management in Bioinformatics.

Chapter 11: Case Studies of XML Used with IBM DB2 Universal Database.

Chapter 12: The Design and Implementation of an Engineering Data Management System Using XML and J2EE.

Chapter 13: Geographical Data Interchange Using XML-Enabled Technology within the GIDB System.

Chapter 14: Space Wide Web by Adapters in Distributed Systems Configuration from Reusable Components.

Chapter 15: XML as a Unifying Framework for Inductive Databases.

Chapter 16: Designing and Managing an XML Warehouse.

Chapter 17: XML Management System Benchmarks.

Chapter 18: The Michigan Benchmark: A Micro-Benchmark for XML Query Performance Diagnostics.

Chapter 19: A Comparison of Database Approaches for Storing XML Documents.

Chapter 20: Performance Analysis between an XML-Enabled Database and a Native XML Database.

Index. 0201844524T02182003

Read More Show Less

Preface

The past few years have seen a dramatic increase in the popularity and adoption of XML, the Extensible Markup Language. This explosive growth is driven by its ability to provide a standardized, extensible means of including semantic information within documents describing semi-structured data. This makes it possible to address the shortcomings of existing markup languages such as HTML and support data exchange in e-business environments.

Consider, for instance, the simple HTML document in Listing P.1. The data contained in the document is intertwined with information about its presentation. In fact, the tags describe only how the data is to be formatted. There is no semantic information that the data represents a person's name and address. Consequently, an interpreter cannot make any sound judgments about the semantics as the tags could as well have enclosed information about a car and its parts. Systems such as WIRE (Aggarwal et al. 1998) can interpret the information by using search templates based on the structure of HTML files and the importance of information enclosed in tags defining headings and so forth. However, such interpretation lacks soundness, and its accuracy is context dependent.

Listing P.1 An HTML Document with Data about a Person



Person Information

Name: John Doe

Address: 10 Church Street, Lancaster LAX 2YZ,
UK


Dynamic Web pages, where the data resides in a backend database and is served using predefined templates, reduce the coupling between the data and its representation. However, the semantics of the data can still be confusing when exchanging information in an e-business environment. A particular item could be represented using different names (in the simplest case) in two systems in a business-to-business transaction. This enforces adherence to complex, often proprietary, document standards.

XML provides inherent support for addressing the above problems, as the data in an XML document is self-describing. However, the increasing adoption of XML has also raised new challenges. One of the key issues is the management of large collections of XML documents. There is a need for tools and techniques for effective storage, retrieval, and manipulation of XML data. The aim of this book is to discuss the state-of-the-art in such tools and techniques.

This preface introduces the basics of XML and some related technologies before moving on to providing an overview of issues relating to XML data management and approaches addressing these issues. Only an overview of XML and related technologies is provided because several other sources cover these concepts in depth.

P.1 What Is XML?

XML is a W3C standard for document markup. It makes it possible to define custom tags describing the data enclosed by them. An example XML document containing data about a person is shown in Listing P.2. Note that tags in XML can have attributes. However, for simplicity, they have not been used in this example.

Listing P.2 An XML Document with Data about a Person




Doe
John


10
Church Street
Lancaster
LAX 2YZ
UK

Unlike the HTML document in Listing P.1, the document in Listing P.2 contains only the data about the person and no representational information. The data and its meaning can be read from the document and the document formatted in a range of fashions as desired. One standard approach is to use XSL, the eXtensible Stylesheet Language.

The flexible nature of XML makes it an ideal basis for defining arbitrary languages. One such example is WML, the Wireless Markup Language. Similarly, the XML schema language used to describe the structure of XML documents is based on XML itself.

P.1.1 Well-Formed and Valid XML

Although XML syntax is flexible, it is constrained by a grammar that governs the permitted tag names, attachment of attributes to tags, and so on. All XML documents must conform to these basic grammar rules. Such conformant documents are said to be well formed and can be interpreted by an XML interpreter, which means it's not necessary to write an interpreter for each XML document instance.In addition to being well formed, the structure of a particular XML document can be validated against a Document Type Definition (DTD) or an XML schema. An XML document conforming to a given DTD or schema is said to be valid.

P.1.2 Data-Centric and Document-Centric XML

XML documents can be classified on the basis of data they contain. Data-centric documents capture structured data such as that pertaining to a product catalog, an order, or an invoice. Document-centric documents, on the other hand, capture unstructured data as in articles, books, or e-mails. Of course, the two types can be combined to form hybrid documents that are both data-centric and document-centric. Listings P.3 and P.4 provide examples of data-centric and document-centric XML, respectively.

Listing P.3 Data-Centric XML


Doe

1-234-56789-0
2
30.00


Listing P.4 Document-Centric XML


XML builds on the principles of two
existing languages, HTML
and SGML to create a simple
mechanism . . .
The generalized markup concept . . .

P.2 XML Concepts

This section provides an overview of basic XML concepts: DTDs, XML schemas, DOM, and SAX.

P.2.1 DTDs and XML Schemas

Both DTDs and XML schemas are mechanisms used to define the structure of XML documents. They determine what elements can be contained within the XML document, how they are to be used, what default values their attributes can have, and so on. Given a DTD or XML schema and its corresponding XML document, a parser can validate whether the document conforms to the desired structure and constraints. This is particularly useful in data exchange scenarios as DTDs and XML schemas provide and enforce a common vocabulary for the data to be exchanged.

XML DTDs are subsets of SGML (Standard Generalized Markup Language) DTDs. An XML DTD lists the various elements and attributes in a document and the context in which they are to be used. It can also list any elements a document cannot contain. However, it does not define constraints such as the number of instances of a particular element within a document, the type of data within each element, and so on. Consequently, DTDs are inherently suitable for document-centric XML as compared to data-centric XML because data-typing and instantiation constraints are less critical in the former case. However, they can be and are being used for both types of documents.

Listing P.5 shows a DTD for the simple XML document in Listing P.2. It describes which primitive elements form valid components for the three composite ones: person, name, and address. The keyword #PCDATA signifies that the element does not contain any tags or child elements and only parsed character data.

Listing P.5 A DTD for the Simple XML Document in Listing P.2










XML schemas differ from DTDs in that the XML schema definition language is based on XML itself. As a result, unlike DTDs, the set of constructs available for defining an XML document is extensible. XML schemas also support namespaces and richer and more complex structures than DTDs. In addition, stronger typing constraints on the data enclosed by a tag can be described because a range of primitive data types such as string, decimal, and integer are supported. This makes XML schemas highly suitable for defining data-centric documents. Another significant advantage is that XML schema definitions can exploit the same data management mechanisms as designed for XML; an XML schema is an XML document itself. This is in direct contrast with DTDs, which require specific support to be built into an XML data management system.

Listing P.6 shows an XML schema for the simple XML document in Listing P.2. The sequence tag is a compositor indicating an ordered sequence of subelements. There are other compositors for choice and all. Also, note that, as shown for the address element, it is possible to constrain the minimum and maximum instances of an element within a document. Although not shown in the example, it is possible to define custom complex and simple types. For instance, a complex type Address could have been defined for the address element.

Listing P.6 An XML Schema for the Simple XML Document in Listing P.2





























P.2.2 DOM and SAX

DOM and SAX are the two main APIs for manipulating XML documents in an application. They are now part of the Java API for XML Processing (JAXP version 1.1). DOM is the W3C standard Document Object Model, an operating system--and programming language--independent model for storing and manipulating hierarchical documents in memory. A DOM parser parses an XML document and builds a DOM tree, which can then be used to traverse the various nodes. However, the tree has to be constructed before traversal can commence. As a result, memory management is an issue when manipulating large XML documents. This is highly resource intensive especially in cases where only a small section of the document is to be manipulated.

SAX, the Simple API for XML, is a de facto standard. It differs from DOM in that it uses an event-driven model. Each time a starting or closing tag, or processing instruction is encountered, the program is notified. As a result, the whole document does not need to be parsed before it is manipulated. In fact, sections of the document can be manipulated as they are parsed. Therefore, SAX is better suited to manipulating large documents as compared to DOM.

P.3 XML-Related Technologies

This section describes some of the technologies related to XML--namely, XPath, XSL, and SOAP.

P.3.1 XPath XPath, the XML Path Language, provides common syntax and semantics for locating

and linking to information contained within an XML document. Using XPath the information can be addressed in two ways:

  1. A hierarchical fashion based on the ordering of elements in a document tree
  2. An arbitrary manner relying on elements in a document tree having unique identifiers

A few example XPath expressions, based on the sample XML document in Listing P.2, are shown in Listing P.7. Example 1 expresses all children named firstname in the current focus element. Example 2 selects the child node surname whose parent node is name within the current focus element, while example 3 tests whether an element is present in the union of the elements name and address. Note that, although not shown in the examples, it is also possible to specify constraints such as first address of the third person in the document.

Listing P.7 Example XPath Expressions

1. select="firstname"
2. select="name/surname"
3. match="name address"
P.3.2 XSL

Since an XML document does not contain any representational information, it can be formatted in a flexible manner. A standard approach to formatting XML documents is using XSL, the eXtensible Stylesheet Language. The W3C XSL specification is composed of two parts: XSL Formatting Objects (XSL FO) and XSL Transformations (XSLT).

XSL FO provides formatting and flow semantics for rendering an XML document. A rendering agent is responsible for interpreting the abstract constructs provided by XSL FO in order to instantiate the representation for a particular medium.XSLT offers constructs to transform information from one organization to another. Although designed to transform an XML vocabulary to an XSL FO vocabulary, XSLT can be used for a range of transformations including those to HTML as shown in Listing P.8. The example style sheet uses a set of simple XSLT templates and XPath expressions to transform a part of the XML document in Listing P.2 to HTML (see Listing P.9).

Listing P.8 An XSL Style Sheet for the XML Document in Listing P.2


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">


PersonInformation









Surname:



First name:



Listing P.9 HTML Resulting from the Transformation in Listing P.8



Person Information


Surname: Doe

First name: John



P.3.3 SOAP

SOAP is the Simple Object Access Protocol used to invoke code over the Internet using XML and HTTP. The mechanism is similar to Java Remote Method Invocation (RMI). In SOAP, method calls are converted to XML and transmitted over HTTP. SOAP was designed for compatibility with XML schemas though their use is not mandatory. Being based on XML, XML schemas offer a seamless means to describe and transmit SOAP types.

P.4 XML Data Management

So far, we have discussed the basics of XML and some of its related technologies. The discussion emphasizes the fundamental advantages of XML, hence providing an insight into the reasons behind its growing popularity and adoption. As more and more organizations and systems employ XML within their information management and exchange strategies, classical data management issues pertaining to XML's efficient and effective storage, retrieval, querying, indexing, and manipulation arise. At the same time, previously uncharted information-modeling challenges appear.

Database vendors have responded to these new data and information management needs. Most commercial relational, object-relational, and object-oriented database systems offer extensions and plug-ins and other mechanisms to support the management of XML data. In addition to supporting XML within existing database management systems, native XML databases have been born. These are designed for seamless storage, retrieval, and manipulation of XML data and integration with related technologies.

With the numerous approaches and solutions available in the market, organizations and system developers with XML data management needs face a variety of challenges:

  • What are the various XML data management solutions available?
  • What are the features, services, and tools offered by these different XML data management systems?
  • How can an in-house, custom solution be developed instead of using a commercially available system?
  • Which XML data management system or approach is the best in terms of performance and efficiency for a particular application?
  • Are there any good practice and domain or application-specific guidelines for information modeling with XML?
  • Are there other examples and applications of XML data management within a particular domain?
  • This book is intended to be a support mechanism to address the above challenges. It provides a discussion of the various XML data management approaches employed in a range of products and applications. It also offers performance and benchmarking results and guidelines relating to information modeling with XML.

    P.5 How This Book Is Organized

    This book is divided into five parts, each containing a coherent and closely related set of chapters. It should be noted that these parts are self-contained and can be read in any order. The five parts are as follows:

    Part I: Introduction
    Part II: Native XML Databases
    Part III: XML and Relational Databases
    Part IV: Applications of XML
    Part V: Performance and Benchmarks
    The parts are summarized in the sections that follow.
    P.5.1 Part I: Introduction

    This part contains a chapter that focuses on guidelines for achieving good grammar and style when modeling information using XML. Brandin, the author, argues that good grammar alleviates the need for redundant domain knowledge required for interpretation of XML by application programs. Good style, on the other hand, ensures improved application performance, especially when it comes to storing, retrieving, and managing information. The discussion offers insight into information-modeling patterns inherent in XML and common XML information-modeling pitfalls.

    P.5.2 Part II: Native XML Databases

    Two native XML database systems, Tamino and eXist, are covered in this part. In Chapter 2, Schoening provides an overview of Tamino's architecture and APIs before moving on to discussing its XML storage and indexing features. Querying, tool support, and access to data in other types of repositories are also described. The chapter offers a comprehensive discussion of the features that are of key importance during the development of an XML data management application.

    In a similar fashion, Chapter 3 by Meier introduces the various features and APIs of the Open Source system eXist. However, in contrast with Chapter 2, the main focus is on how query processing works within the system. As a result, the author provides deeper insight into its indexing and storage architectures. Together both chapters offer a balanced discussion, both on high-level application-programming features of the two systems and underlying indexing and storage mechanisms pertaining to efficient query processing.

    Finally in Chapter 4, we have included an example of an embedded XML database system. This is based upon the general-purpose embedded database engine, Berkeley DB. Berkeley DB XML is able to store XML documents natively, and it provides indexing and an XPath query interface. Some of the capabilities of the product are demonstrated through code examples.

    P.5.3 Part III: XML and Relational Databases

    This part provides an interesting mix of products and approaches to XML data management in relational and object-relational database systems. Chapters 5, 6, and 7 discuss three commercial products: IBM DB2, Oracle9i, and MS SQL Server 2000, respectively, while Chapters 8 and 9 describe more general, roll-your-own strategies for relational and object-relational systems.

    Chapter 5 by Benham highlights the technology and architecture of XML data management and information integration products from IBM. The focus is on the DB2 Universal Database and Xperanto. The former is the family of products providing relational and object-relational data management support for XML applications through the DB2 XML Extender, extended SQL, and support for Web services. The latter is the planned set of products and functions for addressing information integration requirements, which are aimed at complementing DB2 capabilities with additional support for XML and both structured and unstructured applications.

    In Chapter 6, Hohenstein discusses similar features in Oracle9i: the use of Oracle's CLOB functionality and OracleText cartridge, for handling data-centric XML documents, and XMLType, a new object type based on the object-relational functionality in Oracle9i, for managing document-centric ones. He presents the Oracle SQL extensions for XML and provides examples on how to use them in order to build XML documents from relational data. Special features and tools for XML such as URI (Uniform Resource Identifier) support, parsers, class generator and Java Beans encapsulating these features are also described.

    In Chapter 7, Rys covers a feature set, similar to the ones in Chapters 5 and 6, for MS SQL Server 2000. He focuses on scenarios involving exporting and importing structured XML data. As a result, the focus is on the different building blocks such as HTTP and SOAP access, queryable and updateable XML views, rowset views over XML, and XML serialization of relational results. Rowset views and XML serialization are aimed at providing XML support for users more familiar with the relational world. XML views, on the other hand, offer XML-based access to the database for users more comfortable with XML.Collectively, Chapters 5, 6, and 7 furnish an interesting comparison of the functionality offered by the three commercial systems and the various similarities and differences in their XML data management approaches. In contrast, Chapters 8 and 9, by Edwards and Brown, respectively, focus on generic, vendor-independent solutions.

    Edwards describes a generic architecture for storing XML documents in a relational database. The approach is aimed at avoiding vendor-specific database extensions and providing the database application programmer an opportunity to experiment with XML data storage without recourse to implementing much new technology. The database model is based on merging DOM with the Nested Sets Model, hence offering ease of navigation and the ability to store any well-formed XML document. This results in fast serialization and querying but at the expense of update performance.

    While Edwards' architecture is aimed at supporting the traditional relational database programmer, Brown's approach seeks to exploit the advanced features offered by the object-relational model and respective extensions of most relational database systems. He discusses object-relational schema design based on introducing into the DBMS core types and operators equivalent to the ones standardized in XML. The key functionality required of the DBMS core is an extensible indexing system allowing the comparison operator for built-in SQL types to be overloaded. The new SQL 3 types thus defined act as a basis during the mapping of XPath expressions to SQL 3 queries over the schema.

    P.5.4 Part IV: Applications of XML

    This part presents several applications and case studies in XML data management ranging from bioinformatics, geographical and engineering data management, to customer services and cash flow improvement, through to large-scale distributed systems, data warehouses, and inductive database systems.

    In Chapter 10, Direen and Jones discuss various challenges in bioinformatics data management and the role of XML as a means to capture and express complex biological information. They argue that the flexible and extensible information model employed by XML is well suited for the purpose and that database technology must exhibit the same characteristics if it is to keep in step with biological data management requirements. They discuss the role of the NeoCore XML management system in this context and the integration of a BLAST (Basic Local Alignment Search Tool) sequence search engine to enhance its ability to capture, manipulate, analyze, and grow the information pertaining to complex systems that make up living organisms.

    Kowalski presents two case studies involving XML and IBMÕs DB2 Universal Database in Chapter 11. Her first case study is that of a customer services unit that needs to react to problems from the most important customers first. The second case study focuses on improving cash flow in a school by reducing the time for reimbursement from the Department of Education. The author presents the scenario and the particular problem to be solved for each case study, which is followed by an analysis identifying existing conditions preventing the solution of the problem. A description of how XML and DB2 have been used to devise an appropriate solution concludes each case study.

    Chapter 12, by Eglin, Hendra, and Pentakalos, describes the design and implementation of the JEDMICS Open Access Interface, an EJB-based API that provides access to image data stored on a variety of storage media and metadata stored in a relational database. The JEDMICS system uses XML as a portable data exchange solution, and the authors discuss issues relating to its integration with the object-oriented core of the system and the relational database providing the persistent storage. A very interesting feature of the chapter is the authors' reflection on their experiences with a range of XML technologies such as DOM, JDOM, JAXB, XSLT, and Oracle XSU in the context of JEDMICS.

    In Chapter 13, Wilson and her coauthors offer insight into the use of XML to enhance the GIDB (Geospatial Information Database) system to exchange geographical data over the Internet. They describe the integration of meteorological and oceanographic data, received remotely via the METCAST system, into GIDB. XML plays a key role here as it is utilized to express the data model catalog for METCAST. The authors also describe their implementation of the OpenGIS Web Map Server (WMS) specification to facilitate displaying georeferenced map layers from multiple WMS-compliant servers. Another interesting feature of this chapter is the implementation of the ability to read and write vector data using the OpenGIS Geographic Markup Language (GML), an XML-based language standard for data interchange in Geographic Information Systems (GISs).

    Rine sketches his vision of an Interstellar Space Wide Web in Chapter 14. He contrasts the issues relating to the development and deployment of such a facility with the problems encountered in today's World Wide Web. He mainly focuses on adapters as configuration mechanisms for large-scale, next-generation distributed systems and as the means to increase the reusability of software components and architectures in this context. His approach to solving the problem is a configuration model and network-aware runtime environment called Space Wide Web Adapter Configuration eXtensible Markup Language (SWWACXML). The language associated with the environment captures component interaction properties and network-level QoS constraints. Adapters are automatically generated from the SWWACXML specifications. This facilitates reuse because components are not tied to interactions or environments. Rine also discusses the role of the SWWACXML runtime system from this perspective as it supports automatic configuration and dynamic reconfiguration.

    In Chapter 15, Meo and Psaila present an XML-based data model used to bridge the gap between various analysis models and the constraints they place on data representation, retrieval, and manipulation in inductive databases. XDM (XML for Data Mining) allows simultaneous representation of source raw data and patterns. It also represents the pattern definition resulting from the pattern derivation process, hence supporting pattern reuse by the inductive database system. One of the significant advantages of XML in this context is the ability to describe complex heterogeneous topologies such as trees and association rules. In addition, the inherent flexibility of XML makes it possible to extend the inductive database framework with new pattern models and data-mining operators resulting in an open system customizable to the needs of the analyst.

    Chapter 16, the last chapter in this part, describes Baril's and Bellahsene's experiences in designing and managing an XML data warehouse. They propose the use of a view model and a graphical tool for the warehouse specification. Views defined in the warehouse allow filtering and restructuring of XML sources. The warehouse is defined as a set of materialized views, and it provides a mediated schema that constitutes a uniform query interface. They also discuss mapping techniques to store XML data using a relational database system without redundancies and with optimized storage space. Finally, the DAWAX system implementing these concepts is presented.

    P.5.5 Part V: Performance and Benchmarks

    XML database management systems face the same stringent efficiency and performance requirements as any other database technology. Therefore, the final part of this book is devoted to a discussion of benchmarks and performance analyses of such systems.

    Chapter 17

    0201844524P03032003

    Read More Show Less

    Customer Reviews

    Be the first to write a review
    ( 0 )
    Rating Distribution

    5 Star

    (0)

    4 Star

    (0)

    3 Star

    (0)

    2 Star

    (0)

    1 Star

    (0)
    Sort by: Showing 1 Customer Reviews
    • Anonymous

      Posted Thu May 01 00:00:00 EDT 2003

      Precisely what we needed

      At our company, we write Java applications. Soon, we got to the point that we needed a more formal way to read/write data than merely an ad hoc approach. We use XML. The obvious approach is to use a well tested relational database, like those supplied by IBM, Oracle or Microsoft. A problem was getting detailed, objective explanations of what would be involved with each choice. Each vendor is perfectly willing to be our 'friend' and supply us with reams of documentation. But still... The chapters in this book that describe how to hook up XML to those 3 vendors' databases were excellent and clear. But what we ended up doing was going with something suggested in ANOTHER chapter - building an embedded XML database. You will not see this advocated by a vendor; there is no sale for them here. Other than this book, we found it tough to get lucid explanations of the pros and cons of this route. It will take more work, but we hope it will give better performance - no interprocess communication, for one thing. Plus of course no licence fees, and easier installation and management, since we will have access/own all the source code. This was not our original intention, by any means. But the book's comparative analysis was so persuasive that we ended up taking this road. (Hopefully, it will not be a dead end.) That one chapter on embedded XML databases was, to us, the most precious thing in the entire book!

      Was this review helpful? Yes  No   Report this review
    Sort by: Showing 1 Customer Reviews

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)