skip to main content
article

XML stream processing using tree-edit distance embeddings

Published: 01 March 2005 Publication History

Abstract

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O(log2n log*n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.

References

[1]
Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). 275--286.
[2]
Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Philadeplphia, PA).
[3]
Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing (Philadelphia, PA). 20--29.
[4]
Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 53--64.
[5]
Apostolico, A. and Galil, Z., Eds. 1997. Pattern Matching Algorithms. Oxford University Press, Oxford, U.K.
[6]
Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J. 2002. Characterizing memory requirements for queries over continuous data streams. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 221--232.
[7]
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the Twenty-First ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 1--16.
[8]
Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM'02), (Cambridge, MA).
[9]
Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query processing using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 111--122.
[10]
Chan, C.-Y., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA).
[11]
Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming (Malaga, Spain).
[12]
Charikar, M. and Sahai, A. 2002. Dimension reduction in the l1 norm. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (Vancouver, B.C., Canada).
[13]
Cormode, G., Datar, M., Indyk, P., and Muthukrishnan, S. 2002a. Comparing data streams using hamming norms (how to zero in). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 335--345.
[14]
Cormode, G., Indyk, P., Koudas, N., and Muthukrishnan, S. 2002b. Fast mining of massive tabular data via approximate distance computations. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA).
[15]
Cormode, G. and Muthukrishnan, S. 2002. The string edit distance matching problem with moves. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA).
[16]
Dasu, T. and Johnson, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, NY.
[17]
Diao, Y., Altinel, M., Franklin, M. J., Zhang, H., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Database Syst. 28, 4 (Dec.), 467--516.
[18]
Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany).
[19]
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 61--72.
[20]
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-based multi-query processing over data streams. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).
[21]
Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 1999. An approximate L1-difference algorithm for massive data streams. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (New York City, NY).
[22]
Florescu, D., Koller, D., and Levy, A. 1997. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases (Athens, Greece).
[23]
Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).
[24]
Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Querying and mining data streams: you only get one look (Tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).
[25]
Garofalakis, M. and Gibbons, P. B. 2001. Approximate query processing: Taming the terabytes (Tutorial). In Proceedings of the 27th International Conference on Very Large Data Bases (Roma, Italy).
[26]
Garofalakis, M. and Kumar, A. 2003. Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (San Diego, CA). 143--154.
[27]
Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002a. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th Annual ACM Symposium on the Theory of Computing (Montreal, P.Q., Canada).
[28]
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy).
[29]
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 454--465.
[30]
Gravano, L., Ipeirotis, P. G., Jagadish, H., Koudas, N., Muthuksrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy).
[31]
Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, CA).
[32]
Guha, S., Jagadish, H., Koudas, N., Srivastava, D., and Yu, T. 2002. Approximate XML joins. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI).
[33]
Gupta, A. and Suciu, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, CA).
[34]
Indyk, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (Redondo Beach, CA). 189--197.
[35]
Indyk, P. 2001. Algorithmic aspects of geometric embeddings. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science (Las Vegas, NV).
[36]
Indyk, P., Koudas, N., and Muthukrishnan, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 363--372.
[37]
Ioannidis, Y. E. and Poosala, V. 1999. Histogram-based approximation of set-valued query answers. In Proceedings of the 25th International Conference on Very Large Data Bases (Edinburgh, Scotland).
[38]
Johnson, W. B. and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into Hilbert space. Contemp. Math. 26, 189--206.
[39]
Karp, R. M. and Rabin, M. O. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Devel. 31, 2 (Mar.), 249--260.
[40]
Knuth, D. E. 1973. The Art of Computer Programming (Vol. 1/Fundamental Algorithms). Addison-Wesley, Reading, MA.
[41]
Lakshmanan, L. V. S. and Parthasarathy, S. 2002. On efficient matching of streaming XML documents and queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT'2002, Prague, Czech Republic).
[42]
Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 346--357.
[43]
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, U.K.
[44]
Nolan, J. P. 2004. Stable distributions: Models for heavy-tailed data. Available online at http://academic2.american.edu/~jpnolan/stable/stable.html.
[45]
Polyzotis, N. and Garofalakis, M. 2002. Statistical synopses for graph-structured XML databases. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI).
[46]
Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France).
[47]
Schmidt, A., Waas, F., Kersten, M., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).
[48]
Shapira, D. and Storer, J. A. 2002. Edit distance with move operations. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'2002), Fukuoka, Japan). 85--98.
[49]
Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 428--439.
[50]
Uchaikin, V. V. and Zolotarev, V. M. 1999. Chance and Stability : Stable Distributions and their Applications. VSP, Utrecht, The Netherland.
[51]
Ukkonen, E. 1992. Approximate string matching with q-grams and maximal matches. Theoret. Comput. Sci. 92, 191--211.
[52]
Vitter, J. S. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA).
[53]
Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6 (Dec.), 1245--1262.

Cited By

View all
  • (2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
  • (2023)Wasserstein graph distance based on L1-approximated tree edit distance between Weisfeiler-Lehman subtreesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25916(7539-7549)Online publication date: 7-Feb-2023
  • (2022)An I/O-efficient disk-based graph system for scalable second-order random walk of large graphsProceedings of the VLDB Endowment10.14778/3529337.352934615:8(1619-1631)Online publication date: 1-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 30, Issue 1
Special Issue: SIGMOD/PODS 2003
March 2005
332 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1061318
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2005
Published in TODS Volume 30, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML
  2. approximate query processing
  3. data streams
  4. data synopses
  5. metric-space embeddings
  6. tree-edit distance

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
  • (2023)Wasserstein graph distance based on L1-approximated tree edit distance between Weisfeiler-Lehman subtreesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25916(7539-7549)Online publication date: 7-Feb-2023
  • (2022)An I/O-efficient disk-based graph system for scalable second-order random walk of large graphsProceedings of the VLDB Endowment10.14778/3529337.352934615:8(1619-1631)Online publication date: 1-Apr-2022
  • (2022)Structure-Preserving Hashing for Tree-Structured DataSignal, Image and Video Processing10.1007/s11760-022-02166-716:8(2045-2053)Online publication date: 13-Mar-2022
  • (2021)AUTOGRProceedings of the VLDB Endowment10.14778/3461535.346154114:9(1517-1530)Online publication date: 22-Oct-2021
  • (2021)New and improved algorithms for unordered tree inclusionTheoretical Computer Science10.1016/j.tcs.2021.06.013883(83-98)Online publication date: Sep-2021
  • (2020)Scalable SPARQL querying of large RDF graphsProceedings of the VLDB Endowment10.14778/3402707.34027474:11(1123-1134)Online publication date: 3-Jun-2020
  • (2020)Profiling, what-if analysis, and cost-based optimization of MapReduce programsProceedings of the VLDB Endowment10.14778/3402707.34027464:11(1111-1122)Online publication date: 3-Jun-2020
  • (2018)Efficient distributed memory management with RDMA and cachingProceedings of the VLDB Endowment10.14778/3236187.323620911:11(1604-1617)Online publication date: 1-Jul-2018
  • (2018)Streaming graph partitioningProceedings of the VLDB Endowment10.14778/3236187.323620811:11(1590-1603)Online publication date: 1-Jul-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media