Processing XML data using a relational database : Schema-Based XML Storage


In this work, we consider the problem of using a relational database to store and query XML data in a schema-based fashion. There are two main parts to this problem:
The focus of our work is on the query translation problem. It can be broadly classified as follows
  1. Developing query translation algorithms for the case when the XML Schema and/or the XML query may be recursive.
  2. Designing algorithms that make better use of the XML-to-Relational mapping information during the query translation process.
  3. Studying the interaction between the two subproblems: choosing a good relational decomposition for storing the XML data and choosing a query translation algorithm.
We first briefly describe these three ideas.We then look at XML-to-SQL query translation in the XML Publishing scenario, where existing relational data is exported as XML documents.

Recursive Schemas and Recursive Queries

Though there has been a lot of work on proposing alternative relational decompositions for XML data, there has not been much work on developing query translation algorithms.

We first present a framework for representing a fairly large class of XML-to-Relational mappings. To the best of our knowledge, this covers all the techniques proposed in published literature. We then present a generic algorithm for translating path expression queries into SQL over a given XML-to-Relational mapping. Here the XML schema and/or path expression query may be recursive. Our algorithm is described in detail here.

Some of the interesting issues that arise and are addressed by our work are as follows:


Mapping-aware query translation algorithm

We will use the following example scenario to illustrate what we mean by mapping-aware query translation algorithms and the necessity for such algorithms. The XML Schema shown below corresponds to a collection of XML documents, each of which represent the contents of a book. One possible way to map this XML data into relations is shown next to the schema. All the book elements are stored in a Book relation and the corresponding titles
Example XML-to-Relational mapping
Book
id
title



Author
id
parentid
...




Section
id
parentid
parentcode
title





are stored in the title column of the Book  relation. For each book, the authors of the book are stored in an Author rela tion. The parentid field in the Author relation is a foreignkey to the id field of the Book relation. Using this key-foreignkey relationship, the parent-child relationship between the book and author elements is maintained. Similarly, the information about the sections of a book are stored in a Section relation. Since, the section element in the XML Schema is recursive, it means that a section element in an instance XML document may have either a book parent element (top-level sections in the book) or a section parent element (nested sections). To keep track of this in the relational schema, there is an additional column in the Section relation called parentcode - in this case, a value of 1 in this field indicates a top-level section and a value of 2 indicates a nested section. The relationship between the XML Schema and the relational schema created to store the XML data can be expressed as annotations on the XML schema. Some of these annotations are shown in red in the figure above.

Consider the following query : Retrieve all the Section titles. This query can be written in XQuery as follows.
                           
Q1:                      for $title in document(*)//section/title
                            return $title

Notice how both the XML Schema and the XML query are recursive. An equivalent SQL query would be as follows:

SQ1:                   with Temp(id,title) as (
                                select  S.id, S.title
                                from   Book B, Section S
                                where  B.id = S.parentid and S.parentcode = 1
                              union all
                                select S.id, S.title
                                from Temp T, Section S
                                where T.id = S.parentid and S.parentcode = 2
                           )

                           select title
                           from Temp

This is the relational query output by the algorithm we developed earlier. Intuitively, our algorithm matches the query  with the schema, identifies the corresponding matching paths, and issued the root-to-leaf query for these matching paths. The final query SQ1 corresponds to the matching book/section*/title path in the XML schema.
 
Looking at the XML-to-Relational mapping we notice that the following, much simpler relational query is also a  correct translation under this particular mapping.

MASQ1:            select title
                           from Section

 Why is MASQ1 equivalent to SQ1 and therefore to Q1? For this particular example,
it is due to the fact that there is only one schema node mapped to Section.title and all elements corresponding to this schema node were part of the query result. By looking at the annotations in the XML-to-Relational mapping, we can get this information and realize that MASQ1 is a correct translation for Q1. In general, any algorithm that uses the mapping information to make such observations and remove redundant parts of the SQL query can be viewed as a mapping-aware algorithm. We have extended our algorithm for translating path expression queries into SQL to use the mapping information and eliminate redundant parts of the final SQL query.

We next give another example queries to illustrate how our mapping-aware algorithm can  generate efficient SQL queries.

Q2:                      Retrieve all the top-level section titles

                            for $title in document(*)/book/section/title

                            return $title


Using a non-mapping-aware algorithm, the equivalent SQL query will be


SQ2:                  select S.title
                          from Book B, Section S
                          where B.id = S.parentid and S.parentcode = 1
 

On the other hand, the output of our mapping-aware algorithm will be

MASQ2:           select title
                          from Section
                          where parentcode = 1


Notice how this is an example scenario where the XML schema is recursive but the XML query is not. We next present a scenario where the XML query is recursive but the XML schema is not and mapping-aware techniques are still important. Let us modify the above XML schema to allow only sections and subsections. The modified schema is shown below. The same relational decomposition is still valid. Let us assume that we use the same decomposition.

Sample non-recursive XML schemaConsider the query Q1 over this schema

One equivalent SQL query will be

SQ2:                  select  S.title
                          from   Book B, Section S
                          where B.id = S.parentid and S.parentcode = 1
                      union all
                           select  S2.title
                           from   Book B, Section S1, Section S2
                           where  B.id = S1.parentid and S1.parentcode = 1 and
                                        S1.id = S2.parentid and S2.parentcode = 2



On the other hand, using the mapping information, we can obtain the far simpler, equivalent SQL query given below.

MASQ1:            select title
                           from Section

The above examples show how mapping-aware query translation can result in far simpler SQL queries. We have extended our algorithm for translating path expression queries into SQL to use the mapping information and avoid
redundant computation in the SQL query that is implied by the XML-to-Relational mapping. We make the following assumptions about the XML-to-Relational mapping in this case.
We would like to point out the every decomposition scheme we have encountered in the literature satisfies the above properties.

Are the two subproblems independent?

Recall that there are two main problems in the XML Storage domain: one is to pick a good relational decomposition and the other is to translate queries over this XML-to-Relational mapping. Are these two independent subproblems that can solved in isolation? In this work, we show that they are not. In fact, the presence of mapping-aware translation algorithms that make use of the properties of the relational decomposition indicate that the two subproblems are closely related. More details can be found in the ICDT 2003 paper On the Difficulty of Finding Optimal Relational Decompositions for XML Workloads: A Complexity Theoretic Perspective.
                                                                                                                                                Next: XML Publishing

  1.  [STZ+99] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang,Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314
  2. [ML02] Murali Mani, Dongwon Lee: XML to Relational Conversion Using Theory of Regular Tree Grammars. EEXTT 2002: 81-103
  3. [Choi02] Byron Choi: What Are Real DTDs Like, WebDB 2002