article

XML stream processing using tree-edit distance embeddings

Authors:

Minos Garofalakis,

Amit KumarAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 30, Issue 1

Pages 279 - 332

https://doi.org/10.1145/1061318.1061326

Published: 01 March 2005 Publication History

Abstract

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L₁ vector space while guaranteeing a (worst-case) upper bound of O(log²n log*n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.

References

[1]

Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). 275--286.

Digital Library

[2]

Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Philadeplphia, PA).

Digital Library

[3]

Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing (Philadelphia, PA). 20--29.

Digital Library

[4]

Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 53--64.

Digital Library

[5]

Apostolico, A. and Galil, Z., Eds. 1997. Pattern Matching Algorithms. Oxford University Press, Oxford, U.K.

Digital Library

[6]

Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J. 2002. Characterizing memory requirements for queries over continuous data streams. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 221--232.

Digital Library

[7]

Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the Twenty-First ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 1--16.

Digital Library

[8]

Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM'02), (Cambridge, MA).

Digital Library

[9]

Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query processing using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 111--122.

Digital Library

[10]

Chan, C.-Y., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA).

Digital Library

[11]

Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming (Malaga, Spain).

Digital Library

[12]

Charikar, M. and Sahai, A. 2002. Dimension reduction in the l1 norm. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (Vancouver, B.C., Canada).

Digital Library

[13]

Cormode, G., Datar, M., Indyk, P., and Muthukrishnan, S. 2002a. Comparing data streams using hamming norms (how to zero in). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 335--345.

Digital Library

[14]

Cormode, G., Indyk, P., Koudas, N., and Muthukrishnan, S. 2002b. Fast mining of massive tabular data via approximate distance computations. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA).

Digital Library

[15]

Cormode, G. and Muthukrishnan, S. 2002. The string edit distance matching problem with moves. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA).

Digital Library

[16]

Dasu, T. and Johnson, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, NY.

Digital Library

[17]

Diao, Y., Altinel, M., Franklin, M. J., Zhang, H., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Database Syst. 28, 4 (Dec.), 467--516.

Digital Library

[18]

Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany).

Digital Library

[19]

Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 61--72.

Digital Library

[20]

Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-based multi-query processing over data streams. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).

[21]

Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 1999. An approximate L1-difference algorithm for massive data streams. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (New York City, NY).

Digital Library

[22]

Florescu, D., Koller, D., and Levy, A. 1997. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases (Athens, Greece).

Digital Library

[23]

Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).

[24]

Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Querying and mining data streams: you only get one look (Tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).

[25]

Garofalakis, M. and Gibbons, P. B. 2001. Approximate query processing: Taming the terabytes (Tutorial). In Proceedings of the 27th International Conference on Very Large Data Bases (Roma, Italy).

Digital Library

[26]

Garofalakis, M. and Kumar, A. 2003. Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (San Diego, CA). 143--154.

Digital Library

[27]

Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002a. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th Annual ACM Symposium on the Theory of Computing (Montreal, P.Q., Canada).

Digital Library

[28]

Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy).

Digital Library

[29]

Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 454--465.

Digital Library

[30]

Gravano, L., Ipeirotis, P. G., Jagadish, H., Koudas, N., Muthuksrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy).

Digital Library

[31]

Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, CA).

Digital Library

[32]

Guha, S., Jagadish, H., Koudas, N., Srivastava, D., and Yu, T. 2002. Approximate XML joins. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI).

Digital Library

[33]

Gupta, A. and Suciu, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, CA).

Digital Library

[34]

Indyk, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (Redondo Beach, CA). 189--197.

Digital Library

[35]

Indyk, P. 2001. Algorithmic aspects of geometric embeddings. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science (Las Vegas, NV).

Digital Library

[36]

Indyk, P., Koudas, N., and Muthukrishnan, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 363--372.

Digital Library

[37]

Ioannidis, Y. E. and Poosala, V. 1999. Histogram-based approximation of set-valued query answers. In Proceedings of the 25th International Conference on Very Large Data Bases (Edinburgh, Scotland).

Digital Library

[38]

Johnson, W. B. and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into Hilbert space. Contemp. Math. 26, 189--206.

[39]

Karp, R. M. and Rabin, M. O. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Devel. 31, 2 (Mar.), 249--260.

Digital Library

[40]

Knuth, D. E. 1973. The Art of Computer Programming (Vol. 1/Fundamental Algorithms). Addison-Wesley, Reading, MA.

[41]

Lakshmanan, L. V. S. and Parthasarathy, S. 2002. On efficient matching of streaming XML documents and queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT'2002, Prague, Czech Republic).

Digital Library

[42]

Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 346--357.

Digital Library

[43]

Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, U.K.

Digital Library

[44]

Nolan, J. P. 2004. Stable distributions: Models for heavy-tailed data. Available online at http://academic2.american.edu/~jpnolan/stable/stable.html.

[45]

Polyzotis, N. and Garofalakis, M. 2002. Statistical synopses for graph-structured XML databases. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI).

Digital Library

[46]

Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France).

Digital Library

[47]

Schmidt, A., Waas, F., Kersten, M., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).

Digital Library

[48]

Shapira, D. and Storer, J. A. 2002. Edit distance with move operations. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'2002), Fukuoka, Japan). 85--98.

Digital Library

[49]

Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 428--439.

Digital Library

[50]

Uchaikin, V. V. and Zolotarev, V. M. 1999. Chance and Stability : Stable Distributions and their Applications. VSP, Utrecht, The Netherland.

[51]

Ukkonen, E. 1992. Approximate string matching with q-grams and maximal matches. Theoret. Comput. Sci. 92, 191--211.

Digital Library

[52]

Vitter, J. S. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA).

Digital Library

[53]

Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6 (Dec.), 1245--1262.

Digital Library

Cited By

Lyu WSridharamurthy RPhillips JWang B(2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3456383
Fang ZHuang JSu XKasai HWilliams BChen YNeville J(2023)Wasserstein graph distance based on L1-approximated tree edit distance between Weisfeiler-Lehman subtreesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25916(7539-7549)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i6.25916
Li HShao YDu JCui BChen L(2022)An I/O-efficient disk-based graph system for scalable second-order random walk of large graphsProceedings of the VLDB Endowment10.14778/3529337.352934615:8(1619-1631)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.14778/3529337.3529346
Show More Cited By

Index Terms

XML stream processing using tree-edit distance embeddings

Recommendations

Correlating XML data streams using tree-edit distance embeddings
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is ...
XSKETCH synopses for XML data graphs

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language ...
Analysis of tree edit distance on XML data
CIIT '07: The Sixth IASTED International Conference on Communications, Internet, and Information Technology

The problem of comparing tree structures occurs in various areas in computer science and engineering, including the application to XML data processing. To solve this problem, tree edit distance is a common and significant measurement defining the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 30, Issue 1

Special Issue: SIGMOD/PODS 2003

March 2005

332 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/1061318

Issue’s Table of Contents

Copyright © 2005 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2005

Published in TODS Volume 30, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
1,428
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lyu WSridharamurthy RPhillips JWang B(2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3456383
Fang ZHuang JSu XKasai HWilliams BChen YNeville J(2023)Wasserstein graph distance based on L1-approximated tree edit distance between Weisfeiler-Lehman subtreesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25916(7539-7549)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i6.25916
Li HShao YDu JCui BChen L(2022)An I/O-efficient disk-based graph system for scalable second-order random walk of large graphsProceedings of the VLDB Endowment10.14778/3529337.352934615:8(1619-1631)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.14778/3529337.3529346
Xu ZNiu LJi JLi Q(2022)Structure-Preserving Hashing for Tree-Structured DataSignal, Image and Video Processing10.1007/s11760-022-02166-716:8(2045-2053)Online publication date: 13-Mar-2022
https://doi.org/10.1007/s11760-022-02166-7
Wang JLi CMa KHuo JYan FFeng XXu Y(2021)AUTOGRProceedings of the VLDB Endowment10.14778/3461535.346154114:9(1517-1530)Online publication date: 22-Oct-2021
https://dl.acm.org/doi/10.14778/3461535.3461541
Akutsu TJansson JLi RTakasu ATamura T(2021)New and improved algorithms for unordered tree inclusionTheoretical Computer Science10.1016/j.tcs.2021.06.013883(83-98)Online publication date: Sep-2021
https://doi.org/10.1016/j.tcs.2021.06.013
Huang JAbadi DRen K(2020)Scalable SPARQL querying of large RDF graphsProceedings of the VLDB Endowment10.14778/3402707.34027474:11(1123-1134)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402707.3402747
Herodotou HBabu S(2020)Profiling, what-if analysis, and cost-based optimization of MapReduce programsProceedings of the VLDB Endowment10.14778/3402707.34027464:11(1111-1122)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402707.3402746
Cai QGuo WZhang HAgrawal DChen GOoi BTan KTeo YWang S(2018)Efficient distributed memory management with RDMA and cachingProceedings of the VLDB Endowment10.14778/3236187.323620911:11(1604-1617)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.14778/3236187.3236209
Abbas ZKalavri VCarbone PVlassov V(2018)Streaming graph partitioningProceedings of the VLDB Endowment10.14778/3236187.323620811:11(1590-1603)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.14778/3236187.3236208
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents