|
ABSTRACT
The rapid adoption of XML as the standard for data representation and exchange foreshadows a massive increase in the amounts of XML data collected, maintained, and queried over the Internet or in large corporate data-stores. Inevitably, this will result in the development of on-line decision support systems, where users and analysts interactively explore large XML data sets through a declarative query interface (e.g., XQuery or XSLT). Given the importance of remaining interactive, such on-line systems can employ approximate query answers as an effective mechanism for reducing response time and providing users with early feedback. This approach has been successfully used in relational systems and it becomes even more compelling in the XML world, where the evaluation of complex queries over massive tree-structured data is inherently more expensive.In this paper, we initiate a study of approximate query answering techniques for large XML databases. Our approach is based on a novel, conceptually simple, yet very effective XML-summarization mechanism: TREESKETCH synopses. We demonstrate that, unlike earlier techniques focusing solely on selectivity estimation, our TREESKETCH synopses are much more effective in capturing the complete tree structure of the underlying XML database. We propose novel construction algorithms for building TREESKETCH summaries of limited size, and describe schemes for processing general XML twig queries over a concise TREESKETCH in order to produce very fast, approximate tree-structured query answers. To quantify the quality of such approximate answers, we propose a novel, intuitive error metric that captures the quality of the approximation in terms of both the overall structure of the XML tree and the distribution of document edges. Experimental results on real-life and synthetic data sets verify the effectiveness of our TREESKETCH synopses in producing fast, accurate approximate answers and demonstrate their benefits over previously proposed techniques that focus solely on selectivity estimation. In particular, TREESKETCHes yield faster, more accurate approximate answers and selectivity estimates, and are more efficient to construct. To the best of our knowledge, ours is the first work to address the timely problem of producing fast, approximate tree-structured answers for complex XML queries.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
P. Buneman, M. Grohe, and C. Koch. "Path Queries on Compressed XML". In Proceedings of the 29th Intl. Conf. on Very Large Data Bases, 2003.
|
| |
3
|
|
| |
4
|
Don Chamberlin, James Clark, Daniela Florescu, Jonathan Robie, Jérôme Siméon, and Mugur Stefanescu. "XQuery 1.0: An XML Query Language". W3C Working Draft, 2001.
|
| |
5
|
Zhimin Chen, H. V. Jagadish, Laks V. S. Laksmanan, and Stelios Paparizos. "From Tree Patterns to Generalized Tree Patterns: On Efficient Eavluation of XQuery". In Proceedings of the 29th Intl. Conf. on Very Large Data Bases, 2003.
|
| |
6
|
Zhiyuan Chen , H. V. Jagadish , Flip Korn , Nick Koudas , S. Muthukrishnan , Raymond T. Ng , Divesh Srivastava, Counting Twig Matches in a Tree, Proceedings of the 17th International Conference on Data Engineering, p.595-604, April 02-06, 2001
|
| |
7
|
James Clark. "XSL Transformations (XSLT), Version 1.0". W3C Recommendation, November 1999.
|
| |
8
|
James Clark and Steve DeRose. "XML Path Language (XPath), Version 1.0". W3C Recommendation, November 1999.
|
 |
9
|
Juliana Freire , Jayant R. Haritsa , Maya Ramanath , Prasan Roy , Jérôme Siméon, StatiX: making XML count, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
[doi> 10.1145/564691.564713]
|
| |
10
|
|
| |
11
|
|
| |
12
|
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Parr. XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation. In Proceedings of the 28th Intl. Conf. on Very Large Data Bases, 2002.
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
N. Polyzotis and M. Garofalakis. "Structure and Value Synopses for XML Data Graphs". In Proceedings of the 28th Intl. Conf. on Very Large Data Bases, 2002.
|
| |
17
|
Neoklis Polyzotis, Minos Garofalakis, and Yannis Ioannidis. Approximate XML Query Answers. 2004.
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
|
 |
23
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
CITED BY 15
|
|
|
|
|
|
|
Sihem Amer-Yahia , Nick Koudas , Amélie Marian , Divesh Srivastava , David Toman, Structure and content scoring for XML, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
|
|
|
|
|
|
|
|
Ning Zhang , Peter J. Haas , Vanja Josifovski , Guy M. Lohman , Chun Zhang, Statistical learning techniques for costing XML queries, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
A. Balmin , T. Eliaz , J. Hornibrook , L. Lim , G. M. Lohman , D. Simmen , M. Wang , C. Zhang, Cost-based optimization in DB2 XML, IBM Systems Journal, v.45 n.2, p.299-319, January 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Web application security assessment by fault injection and behavior monitoring
Proceedings of the 12th international conference on World Wide Web
Yao-Wen Huang
, Shih-Kun Huang
, Tsung-Po Lin
, Chung-Hung Tsai
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
|