skip to main content
research-article

Inference of concise regular expressions and DTDs

Published:03 May 2010Publication History
Skip Abstract Section

Abstract

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small datasets.

Skip Supplemental Material Section

Supplemental Material

References

  1. Abiteboul, S., Buneman, P., and Suciu, D. 1999. Data on the Web. Morgan Kaufmann Publishers.Google ScholarGoogle Scholar
  2. Ahonen, H. 1996. Generating grammars for structured documents using grammatical inference methods. Ph.D. thesis, Report A-1996-4. Department of Computer Science, University of Helsinki.Google ScholarGoogle Scholar
  3. Angluin, D. and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Comput. Surv. 15, 3, 237--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barbosa, D., Mendelzon, A. O., Keenleyside, J., and Lyons, K. A. 2002. ToXgene: An extensible template-based data generator for XML. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002). 49--54.Google ScholarGoogle Scholar
  5. Barbosa, D., Mignet, L., and Veltri, P. 2006. Studying the XML web: Gathering statistics from an XML sample. World Wide Web 9, 2, 187--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Benedikt, M., Fan, W., and Geerts, F. 2008. XPath satisfiability in the presence of DTDs. J. ACM 55, 2, 1--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Online Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR'03).Google ScholarGoogle Scholar
  8. Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. Learning deterministic regular expressions for the inference of schemas from XML data. http://arxiv.org/abs/1004.2372.Google ScholarGoogle Scholar
  9. Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular expressions for the inference of schemas from XML data. In Proceeding of the 17th International Conference on World Wide Web (WWW'08). 825--834. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bex, G. J., Neven, F., and den Bussche, J. V. 2004. DTDs versus XML Schema: A practical study. In Proceedings of the International Workshop on Web and Database (WebDB). S. Amer-Yahia and L. Gravano, Eds. 79--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds. ACM, 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 998--1009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Brāzma, A. 1993. Efficient identification of regular expressions from representative examples. In Proceedings of the 6th Annual Conference on Computational Learning Theory (COLT'93). ACM Press, 236--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2, 197--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Brüggemann-Klein, A. and Wood, D. 1998. One-Unambiguous regular languages. Inform. Comput. 140, 2, 229--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (ICDT'97). Lecture Notes in Computer Science, vol. 1186. Springer, 336--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Caron, P. and Ziadi, D. 2000. Characterization of Glushkov automata. Theor. Comput. Sci. 233, 1--2, 75--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Castor. The Castor project. www.castor.org.Google ScholarGoogle Scholar
  19. Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB'01). CEUR Workshop Proceedings, vol. 45.Google ScholarGoogle Scholar
  20. Clark, J. Trang: Multi-Format schema converter based on RELAX NG. www.thaiopensource.com/relaxng/trang.html.Google ScholarGoogle Scholar
  21. Cover, R. 2003. The Cover Pages. xml.coverpages.org.Google ScholarGoogle Scholar
  22. Delgado, M. and Morais, J. 2004. Approximation to the smallest regular expression for a given regular language. In Proceedings of the, 9th International Conference on Implementation and Application of Automata. Lecture Notes in Computer Science, vol. 3317. Springer, 312--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Deutsch, A., Fernandez, M. F., and Suciu, D. 1999. Storing semistructured data with STORED. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 431--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 134--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fernandez, M. F. and Suciu, D. 1998. Optimizing regular path expressions using graph schemas. In Proceedings of the 14th International Conference on Data Engineering (ICDE'98). 14--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications. Lecture Notes in Artificial Intelligence, vol. 3264. Springer, 277--278.Google ScholarGoogle Scholar
  27. Fernau, H. 2009. Algorithms for learning regular expressions from positive data. Inform. Comput. 207, 4, 521--541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Florescu, D. 2005. Managing semi-structured data. ACMQueue 3, 8, 18--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. García, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920--925. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Mining Knowl. Discov. 7, 23--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer Science (STACS'08). Dagstuhl Seminar Proceedings, vol. 08001. 325--336.Google ScholarGoogle Scholar
  32. Gold, E. 1967. Language identification in the limit. Inform. Control 10, 5, 447--474.Google ScholarGoogle ScholarCross RefCross Ref
  33. Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97). 436--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Gruber, H. and Holzer, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the 35th International Colloquium on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 5126. Springer, 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Han, Y.-S. and Wood, D. 2007. Obtaining shorter regular expressions from finite-state automata. Theor. Comput. Sci. 370, 1--3, 110--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06). IEEE Computer Society, 81--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Hinkelman, S. 2005. Business integration—Information conformance statements (BI-ICS). Tech. rep., IBM DeveloperWorks.Google ScholarGoogle Scholar
  38. Hopcroft, J. and Ullman, J. 1979. Introduction to Automata Theory, Languages and computation. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Huet, G. 1980. Confluent reductions: Abstract properties and applications to term rewriting systems. J. ACM 27, 4, 797--821. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-Based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). 228--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 241--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Martens, W., Neven, F., and Schwentick, T. 2007. Simple off the shelf abstractions for XML schema. SIGMOD Rec. 36, 3, 15--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Data. Syst. 31, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Melnik, S. 2004. Generic model management: Concepts and algorithms. Ph.D. thesis, University of Leipzig. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML web: A first study. In Proceedings of the 12th International World Wide Web Conference. 500--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Miklau, G. 2002. XMLData repository. www.cs.washington.edu/research/xmldatasets.Google ScholarGoogle Scholar
  48. Min, J.-K., Ahn, J.-Y., and Chung, C.-W. 2003. Efficient extraction of schemas for XML documents. Inform. Process. Lett. 85, 1, 7--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the 13th International Conference on Data Engineering. IEEE Computer Society, 79--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence of disjunction, DTDs, and variables. Logical Methods Comput. Sci. 2, 3.Google ScholarGoogle ScholarCross RefCross Ref
  52. Ngu, A. H. H., Rocco, D., Critchlow, T., and Buttler, D. 2005. Automatic discovery and inferencing of complex bioinformatics web interfaces. World Wide Web 8, 4, 463--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Oaks, P. and ter Hofstede, A. H. M. 2007. Guided interaction: A mechanism to enable ad hoc service interaction. Inform. Syst. Frontiers 9, 1, 29--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ohlebusch, E. 2001. Implementing conditional term rewriting by graph rewriting. Theor. Comput. Sci. 262, 1, 311--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Open Web Application Security Project Consortium. 2004. The top ten most critical web application security vulnerabilities—2004 update. www.owasp.org.Google ScholarGoogle Scholar
  56. Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the International Workshop on Analogical and Inductive Inference (AII'89). Springer-Verlag, 18--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarGoogle ScholarCross RefCross Ref
  58. Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (extended abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases, (WebDB'00), Selected Papers. 171--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Sakakibara, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1, 15--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings of the International Conference on Information and Knowledge Management. ACM Press, 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Sun. Sun JAXB. java.sun.com/webservices/jaxb.Google ScholarGoogle Scholar
  63. Thompson, H. S., Beech, D., Maloney, M., and Mendelsohn, N. 2004. XML Schema part 1: Structures 2nd Ed. World Wide Web Consortium, Recommendation REC-xmlschema-1-20041028.Google ScholarGoogle Scholar
  64. W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language, 2nd Ed. W3C.Google ScholarGoogle Scholar
  65. Wang, G., Liu, M., Yu, J. X., Sun, B., Yu, G., Lv, J., and Lu, H. 2003. Effective schema-based XML query optimization techniques. In Proceedings of the 7th International Database Engineering and Applications Symposium. 230--235.Google ScholarGoogle Scholar

Index Terms

  1. Inference of concise regular expressions and DTDs

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Database Systems
              ACM Transactions on Database Systems  Volume 35, Issue 2
              April 2010
              336 pages
              ISSN:0362-5915
              EISSN:1557-4644
              DOI:10.1145/1735886
              Issue’s Table of Contents

              Copyright © 2010 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 3 May 2010
              • Accepted: 1 November 2009
              • Revised: 1 July 2009
              • Received: 1 January 2009
              Published in tods Volume 35, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader