XML Publishing and Schema-Based XML Storage: Is XML-to-SQL query translation similar in both domains?

There is a lot of interest in exporting existing relational data as XML documents. We refer to this as XML Publishing. Comparing this with the XML Storage scenario, we see that in both cases we have an (logical) XML view of some data that is (physically) stored in a relational database. Moreover, in both the scenarios, given an XML query over the XML view and the mapping between the XML view and the relational schema, the goal is to obtain an equivalent SQL query. The question we ask is the following: Is there any difference between the two domains as far as the query translation problem is concerned? Or are the solutions to one domain directly applicable to the other? Currently, in research literature, the focus is on developing query translation algorithms for the XML publishing domain and the idea seems to be that the same algorithms are directly applicable for the XML Storage domain as well. This is possible as we can reduce any instance of the XML Storage problem to an instance of the XML Publishing problem through the notion of reconstruction XML views [SSK+01] or default XML views. Once we do this, the algorithms for the XML publishing domain are applicable for the XML Storage domain as well. The main question therefore is whether there is something we can do in the XML Storage scenario in a much simpler fashion than in the XML Publishing context. If so, this indicates that we need to look at XML-to-SQL query translation separately for the two problems.

We show that the Schema-Based XML Storage scenario is different from the XML Publishing scenario in the following manner. Previously, we developed mapping-aware translation algorithms for path expression queries for the Schema-Based XML Storage scenario. We show how designing equivalent algorithms for the XML Publishing domain is difficult. In this case, it involves using integrity constraints on the underlying relational data in a fairly complex fashion. We develop algorithms for translating path expression queries into SQL in the XML Publishing scenario over a non-recursive XML schema. Hopefully, this difference between the two domains, will make it clear that the XML-to-SQL query translation problem in the XML Storage domain is far more simpler than the equivalent problem in the XML Publishing domain and so needs to be investigated separately.

The main difference between the two cases is as follows: In the XML Storage scenario, the XML-to-Relational mapping completely defines the contents of the relational database. In other words, the relational database contains exactly the same data as the input set of XML documents. On the other hand, in the XML Publishing scenario, it is possible that only parts of the relational data was exported in the XML view. Similarly, it is possible that some other parts of the XML data were exported several times. So, the XML-to-Relational mapping does not completely describe the underlying relational data. So, performing mapping-aware query translation means that we need to look at other sources of information: namely integrity constraints on the underlying relational data, before we can decide which parts of the SQL query are implied by the mapping and the constraints. We next revisit the example we used for motivating mapping-aware translation for the XML Storage scenario and see what happens in the XML Publishing case.

Suppose we had the following relational schema for the pre-existing relational data. Suppose that we export this

Example XML-to-Relational mapping

Book

bookid	title

Author

authorid	bookid	...

Section

sectionid	bookid	sectionparentid	title

relational data as an XML document according to the XML schema shown in the figure to the left. This relational schema and XML schema pair are very similar to the corresponding pair in the XML Storage scenario. Let us revisit the
same query Q1, which retrieves all the section titles.

Q1: for $title in document(*)//section/title
return $title

An equivalent SQL query according to the above view definition is

SQ1: with Temp(id,title) as (
select S.sectionid, S.title
from Book B, Section S
where B.bookid = S.bookid
union all
select S.sectionid, S.title
from Temp T, Section S
where T.id = S.sectionparentid
)

select title
from Temp

Recall that in the XML Storage scenario, we were able to design a mapping-aware algorithm that simplified this query to the following SQL query

MASQ1: select title
from Section

Can we do the same thing in the XML Publishing domain? If so, what are the assumptions we are making? In this particular case, it can be shown that we can simplify the query to MASQ1 if the following conditions hold

Every section tuple has exactly one of the two fields from bookid and sectionparentid to be non-null.
Book.bookid is a key for the Book relation.
Section.sectionid is a key for the Section relation.
Section.bookid is a foreign key to the Book relation.
Section.sectionid is a foreign key to the Section relation.

If these conditions hold on the relational schema, then we can translate the XML path expression query Q1 into the SQL query MASQ1. Otherwise, we will have to be satisfied with the SQL query SQ1. Notice how the fact that every section has exactly one of either a book parent or a section parent is implicit in the XML Schema in the XML Storage scenario, while in the XML Publishing scenario it is present in the constraint information on the underlying relational data. So, in order to get efficient SQL queries for a given XML query, we need to reason with the constraints on the underlying relational data. We have developed a constraint-aware XML-to-SQL translation algorithm for path expression queries in the XML Publishing scenario, when the XML schema is non-recursive. Details of this algorithm can be found here. Extending this to recursive XML schemas is future work.

Jayavel Shanmugasundaram, Eugene J. Shekita, Jerry Kiernan, Rajasekar Krishnamurthy, Stratis Viglas, Jeffrey F. Naughton, Igor Tatarinov: A General Techniques for Querying XML Documents using a Relational Database System. SIGMOD Record 30(3): 20-26 (2001)

Prev: XML Storage