Processing XML data using a relational database : Schema-Based XML Storage

In this work, we consider the problem of using a relational database to store and query XML data in a schema-based fashion. There are two main parts to this problem:

Given an XML schema, how do we pick a good relational schema to store the XML data

Once we have fixed the relational schema and stored the XML data in the relations accordingly, given an XML query how do we translate this query into an equivalent SQL query.

The focus of our work is on the query translation problem. It can be broadly classified as follows

Developing query translation algorithms for the case when the XML Schema and/or the XML query may be recursive.
Designing algorithms that make better use of the XML-to-Relational mapping information during the query translation process.
Studying the interaction between the two subproblems: choosing a good relational decomposition for storing the XML data and choosing a query translation algorithm.

We first briefly describe these three ideas.We then look at XML-to-SQL query translation in the XML Publishing scenario, where existing relational data is exported as XML documents.

Recursive Schemas and Recursive Queries

Though there has been a lot of work on proposing alternative relational decompositions for XML data, there has not been much work on developing query translation algorithms.

For the scenario when the XML schema is recursive, while [STZ+99] and [ML02] have proposed different ways for mapping recursive XML schemas into relational tables, there is no published work on translating XML queries into SQL in such a setting. In a recent study [Choi02], of the 60 XML schemas analyzed, about 35 of them were recursive. This indicates that recursive XML schemas are an important class of XML schemas and query translation algorithms in this domain are important.
XML query languages have a descendant operator (//) to specify ancestor-descendant relationships between two elements. For example, the query //section//title retrieves all title elements that have a section ancestor element, which is allowed to occur anywhere in the document. Notice, how the section element can be at an arbitrary depth in the document and the title element can, in turn, be at an arbitrary depth below the section ancestor. We refer to such queries having the descendant operator (//) as recursive XML queries. There is no direct equivalent operator in the relational world and, in many cases, this will have to be translated into a fix-point (recursive) SQL query. Prior work in the XML storage scenario has not looked at how to translate recursive XML queries into SQL, even when the XML schema is non-recursive.

We first present a framework for representing a fairly large class of XML-to-Relational mappings. To the best of our knowledge, this covers all the techniques proposed in published literature. We then present a generic algorithm for translating path expression queries into SQL over a given XML-to-Relational mapping. Here the XML schema and/or path expression query may be recursive. Our algorithm is described in detail here.

Some of the interesting issues that arise and are addressed by our work are as follows:

How do we translate path expression queries over arbitrary XML-to-Relational mappings into equivalent SQL queries?
Is the support for recursion in SQL3 sufficient for supporting path expression queries over arbitrary XML-to-Relational mapping?
Are there any issues in the translation process when the XML schema is non-recursive?
Does XPath semantics introduce any interesting challenges?

Mapping-aware query translation algorithm

We will use the following example scenario to illustrate what we mean by mapping-aware query translation algorithms and the necessity for such algorithms. The XML Schema shown below corresponds to a collection of XML documents, each of which represent the contents of a book. One possible way to map this XML data into relations is shown next to the schema. All the book elements are stored in a Book relation and the corresponding titles
Example XML-to-Relational mapping

Book

id	title

Author

id	parentid	...

Section

id	parentid	parentcode	title

are stored in the title column of the Book relation. For each book, the authors of the book are stored in an Author rela tion. The parentid field in the Author relation is a foreignkey to the id field of the Book relation. Using this key-foreignkey relationship, the parent-child relationship between the book and author elements is maintained. Similarly, the information about the sections of a book are stored in a Section relation. Since, the section element in the XML Schema is recursive, it means that a section element in an instance XML document may have either a book parent element (top-level sections in the book) or a section parent element (nested sections). To keep track of this in the relational schema, there is an additional column in the Section relation called parentcode - in this case, a value of 1 in this field indicates a top-level section and a value of 2 indicates a nested section. The relationship between the XML Schema and the relational schema created to store the XML data can be expressed as annotations on the XML schema. Some of these annotations are shown in red in the figure above.

Consider the following query : Retrieve all the Section titles. This query can be written in XQuery as follows.

Q1:                    for $title in document(*)//section/title
return $title

Notice how both the XML Schema and the XML query are recursive. An equivalent SQL query would be as follows:

SQ1: with Temp(id,title) as (
select S.id, S.title
from Book B, Section S
where B.id = S.parentid and S.parentcode = 1
union all
select S.id, S.title
from Temp T, Section S
where T.id = S.parentid and S.parentcode = 2
)

select title
from Temp

This is the relational query output by the algorithm we developed earlier. Intuitively, our algorithm matches the query with the schema, identifies the corresponding matching paths, and issued the root-to-leaf query for these matching paths. The final query SQ1 corresponds to the matching book/section*/title path in the XML schema.

Looking at the XML-to-Relational mapping we notice that the following, much simpler relational query is also a correct translation under this particular mapping.

MASQ1:           select title
from Section

Why is MASQ1 equivalent to SQ1 and therefore to Q1? For this particular example, it is due to the fact that there is only one schema node mapped to Section.title and all elements corresponding to this schema node were part of the query result. By looking at the annotations in the XML-to-Relational mapping, we can get this information and realize that MASQ1 is a correct translation for Q1. In general, any algorithm that uses the mapping information to make such observations and remove redundant parts of the SQL query can be viewed as a mapping-aware algorithm. We have extended our algorithm for translating path expression queries into SQL to use the mapping information and eliminate redundant parts of the final SQL query.

We next give another example queries to illustrate how our mapping-aware algorithm can generate efficient SQL queries.

Q2:                    Retrieve all the top-level section titles

for $title in document(*)/book/section/title
return $title

Using a non-mapping-aware algorithm, the equivalent SQL query will be

SQ2:           select S.title
from Book B, Section S
where B.id = S.parentid and S.parentcode = 1

On the other hand, the output of our mapping-aware algorithm will be

MASQ2:           select title
from Section
                        where parentcode = 1

Notice how this is an example scenario where the XML schema is recursive but the XML query is not. We next present a scenario where the XML query is recursive but the XML schema is not and mapping-aware techniques are still important. Let us modify the above XML schema to allow only sections and subsections. The modified schema is shown below. The same relational decomposition is still valid. Let us assume that we use the same decomposition.

Sample non-recursive XML schema

Consider the query Q1 over this schema

One equivalent SQL query will be

SQ2:           select S.title
from Book B, Section S
where B.id = S.parentid and S.parentcode = 1
union all
select S2.title
                    from Book B, Section S1, Section S2
                           where B.id = S1.parentid and S1.parentcode = 1 and
                                    S1.id = S2.parentid and S2.parentcode = 2

On the other hand, using the mapping information, we can obtain the far simpler, equivalent SQL query given below.

MASQ1:           select title
from Section

The above examples show how mapping-aware query translation can result in far simpler SQL queries. We have extended our algorithm for translating path expression queries into SQL to use the mapping information and avoid
redundant computation in the SQL query that is implied by the XML-to-Relational mapping. We make the following assumptions about the XML-to-Relational mapping in this case.

All the XML data is completely shred into relations and no part of the XML data is stored multiple times.
There is no data in the relations other than that which is present in the XML document.
Enough information is maintained in the relational data to enable reconstruction of the original XML data.

We would like to point out the every decomposition scheme we have encountered in the literature satisfies the above properties.

Are the two subproblems independent?

Recall that there are two main problems in the XML Storage domain: one is to pick a good relational decomposition and the other is to translate queries over this XML-to-Relational mapping. Are these two independent subproblems that can solved in isolation? In this work, we show that they are not. In fact, the presence of mapping-aware translation algorithms that make use of the properties of the relational decomposition indicate that the two subproblems are closely related.

We show, through experiments with a commercial RDBMS and the XMark XML benchmark, that there exist query translation algorithms T1 and T2, and relational decompositions D1 and D2, such that for a simple path expression query, if we use translation algorithm T1, then decomposition D1 is better than D2, while with translation algorithm T2 decomposition D2 is better than D1. This implies that one cannot talk about the quality of a decomposition without discussing the query translation algorithm to be used.
By examining an interesting problem embedded in choosing a relational decomposition, we show that choices of different translation algorithms and cost models result in very different complexities for the resulting optimization problems. Our results suggest that, contrary to the trend in previous work, the eventual development of practical algorithms for finding relational decompositions for XML workloads will require judicious choices of cost models and translation algorithms, rather than an exclusive focus on the decomposition problem in isolation.

More details can be found in the ICDT 2003 paper On the Difficulty of Finding Optimal Relational Decompositions for XML Workloads: A Complexity Theoretic Perspective.
Next: XML Publishing

[STZ+99] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang,Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314
[ML02] Murali Mani, Dongwon Lee: XML to Relational Conversion Using Theory of Regular Tree Grammars. EEXTT 2002: 81-103
[Choi02] Byron Choi: What Are Real DTDs Like, WebDB 2002