Abstract
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small datasets.
Supplemental Material
Available for Download
Online appendix to inference of concise regular expressions and DTDs on article 11.
- Abiteboul, S., Buneman, P., and Suciu, D. 1999. Data on the Web. Morgan Kaufmann Publishers.Google Scholar
- Ahonen, H. 1996. Generating grammars for structured documents using grammatical inference methods. Ph.D. thesis, Report A-1996-4. Department of Computer Science, University of Helsinki.Google Scholar
- Angluin, D. and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Comput. Surv. 15, 3, 237--269. Google ScholarDigital Library
- Barbosa, D., Mendelzon, A. O., Keenleyside, J., and Lyons, K. A. 2002. ToXgene: An extensible template-based data generator for XML. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002). 49--54.Google Scholar
- Barbosa, D., Mignet, L., and Veltri, P. 2006. Studying the XML web: Gathering statistics from an XML sample. World Wide Web 9, 2, 187--212. Google ScholarDigital Library
- Benedikt, M., Fan, W., and Geerts, F. 2008. XPath satisfiability in the presence of DTDs. J. ACM 55, 2, 1--79. Google ScholarDigital Library
- Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Online Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR'03).Google Scholar
- Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. Learning deterministic regular expressions for the inference of schemas from XML data. http://arxiv.org/abs/1004.2372.Google Scholar
- Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular expressions for the inference of schemas from XML data. In Proceeding of the 17th International Conference on World Wide Web (WWW'08). 825--834. Google ScholarDigital Library
- Bex, G. J., Neven, F., and den Bussche, J. V. 2004. DTDs versus XML Schema: A practical study. In Proceedings of the International Workshop on Web and Database (WebDB). S. Amer-Yahia and L. Gravano, Eds. 79--84. Google ScholarDigital Library
- Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds. ACM, 115--126. Google ScholarDigital Library
- Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 998--1009. Google ScholarDigital Library
- Brāzma, A. 1993. Efficient identification of regular expressions from representative examples. In Proceedings of the 6th Annual Conference on Computational Learning Theory (COLT'93). ACM Press, 236--242. Google ScholarDigital Library
- Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theor. Comput. Sci. 120, 2, 197--213. Google ScholarDigital Library
- Brüggemann-Klein, A. and Wood, D. 1998. One-Unambiguous regular languages. Inform. Comput. 140, 2, 229--253. Google ScholarDigital Library
- Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (ICDT'97). Lecture Notes in Computer Science, vol. 1186. Springer, 336--350. Google ScholarDigital Library
- Caron, P. and Ziadi, D. 2000. Characterization of Glushkov automata. Theor. Comput. Sci. 233, 1--2, 75--90. Google ScholarDigital Library
- Castor. The Castor project. www.castor.org.Google Scholar
- Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB'01). CEUR Workshop Proceedings, vol. 45.Google Scholar
- Clark, J. Trang: Multi-Format schema converter based on RELAX NG. www.thaiopensource.com/relaxng/trang.html.Google Scholar
- Cover, R. 2003. The Cover Pages. xml.coverpages.org.Google Scholar
- Delgado, M. and Morais, J. 2004. Approximation to the smallest regular expression for a given regular language. In Proceedings of the, 9th International Conference on Implementation and Application of Automata. Lecture Notes in Computer Science, vol. 3317. Springer, 312--314. Google ScholarDigital Library
- Deutsch, A., Fernandez, M. F., and Suciu, D. 1999. Storing semistructured data with STORED. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 431--442. Google ScholarDigital Library
- Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. J. Comput. Syst. Sci. 12, 134--146. Google ScholarDigital Library
- Fernandez, M. F. and Suciu, D. 1998. Optimizing regular path expressions using graph schemas. In Proceedings of the 14th International Conference on Data Engineering (ICDE'98). 14--23. Google ScholarDigital Library
- Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications. Lecture Notes in Artificial Intelligence, vol. 3264. Springer, 277--278.Google Scholar
- Fernau, H. 2009. Algorithms for learning regular expressions from positive data. Inform. Comput. 207, 4, 521--541. Google ScholarDigital Library
- Florescu, D. 2005. Managing semi-structured data. ACMQueue 3, 8, 18--24. Google ScholarDigital Library
- García, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12, 9, 920--925. Google ScholarDigital Library
- Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Mining Knowl. Discov. 7, 23--56. Google ScholarDigital Library
- Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer Science (STACS'08). Dagstuhl Seminar Proceedings, vol. 08001. 325--336.Google Scholar
- Gold, E. 1967. Language identification in the limit. Inform. Control 10, 5, 447--474.Google ScholarCross Ref
- Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97). 436--445. Google ScholarDigital Library
- Gruber, H. and Holzer, M. 2008. Finite automata, digraph connectivity, and regular expression size. In Proceedings of the 35th International Colloquium on Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 5126. Springer, 39--50. Google ScholarDigital Library
- Han, Y.-S. and Wood, D. 2007. Obtaining shorter regular expressions from finite-state automata. Theor. Comput. Sci. 370, 1--3, 110--120. Google ScholarDigital Library
- Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: Efficient schema extraction from multiple and large XML documents. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06). IEEE Computer Society, 81--97. Google ScholarDigital Library
- Hinkelman, S. 2005. Business integration—Information conformance statements (BI-ICS). Tech. rep., IBM DeveloperWorks.Google Scholar
- Hopcroft, J. and Ullman, J. 1979. Introduction to Automata Theory, Languages and computation. Addison-Wesley. Google ScholarDigital Library
- Huet, G. 1980. Confluent reductions: Abstract properties and applications to term rewriting systems. J. ACM 27, 4, 797--821. Google ScholarDigital Library
- Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-Based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). 228--239. Google ScholarDigital Library
- Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 241--250. Google ScholarDigital Library
- Martens, W., Neven, F., and Schwentick, T. 2007. Simple off the shelf abstractions for XML schema. SIGMOD Rec. 36, 3, 15--22. Google ScholarDigital Library
- Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Data. Syst. 31, 3. Google ScholarDigital Library
- McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54--66. Google ScholarDigital Library
- Melnik, S. 2004. Generic model management: Concepts and algorithms. Ph.D. thesis, University of Leipzig. Google ScholarDigital Library
- Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML web: A first study. In Proceedings of the 12th International World Wide Web Conference. 500--510. Google ScholarDigital Library
- Miklau, G. 2002. XMLData repository. www.cs.washington.edu/research/xmldatasets.Google Scholar
- Min, J.-K., Ahn, J.-Y., and Chung, C.-W. 2003. Efficient extraction of schemas for XML documents. Inform. Process. Lett. 85, 1, 7--12. Google ScholarDigital Library
- Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 295--306. Google ScholarDigital Library
- Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the 13th International Conference on Data Engineering. IEEE Computer Society, 79--90. Google ScholarDigital Library
- Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence of disjunction, DTDs, and variables. Logical Methods Comput. Sci. 2, 3.Google ScholarCross Ref
- Ngu, A. H. H., Rocco, D., Critchlow, T., and Buttler, D. 2005. Automatic discovery and inferencing of complex bioinformatics web interfaces. World Wide Web 8, 4, 463--493. Google ScholarDigital Library
- Oaks, P. and ter Hofstede, A. H. M. 2007. Guided interaction: A mechanism to enable ad hoc service interaction. Inform. Syst. Frontiers 9, 1, 29--51. Google ScholarDigital Library
- Ohlebusch, E. 2001. Implementing conditional term rewriting by graph rewriting. Theor. Comput. Sci. 262, 1, 311--331. Google ScholarDigital Library
- Open Web Application Security Project Consortium. 2004. The top ten most critical web application security vulnerabilities—2004 update. www.owasp.org.Google Scholar
- Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. In Proceedings of the International Workshop on Analogical and Inductive Inference (AII'89). Springer-Verlag, 18--44. Google ScholarDigital Library
- Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarCross Ref
- Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350. Google ScholarDigital Library
- Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (extended abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases, (WebDB'00), Selected Papers. 171--183. Google ScholarDigital Library
- Sakakibara, Y. 1997. Recent advances of grammatical inference. Theor. Comput. Sci. 185, 1, 15--45. Google ScholarDigital Library
- Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings of the International Conference on Information and Knowledge Management. ACM Press, 159--166. Google ScholarDigital Library
- Sun. Sun JAXB. java.sun.com/webservices/jaxb.Google Scholar
- Thompson, H. S., Beech, D., Maloney, M., and Mendelsohn, N. 2004. XML Schema part 1: Structures 2nd Ed. World Wide Web Consortium, Recommendation REC-xmlschema-1-20041028.Google Scholar
- W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language, 2nd Ed. W3C.Google Scholar
- Wang, G., Liu, M., Yu, J. X., Sun, B., Yu, G., Lv, J., and Lu, H. 2003. Effective schema-based XML query optimization techniques. In Proceedings of the 7th International Database Engineering and Applications Symposium. 230--235.Google Scholar
Index Terms
Inference of concise regular expressions and DTDs
Recommendations
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of ...
Learning deterministic regular expressions for the inference of schemas from XML data
WWW '08: Proceedings of the 17th international conference on World Wide WebInferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of ...
Construction of fuzzy automata from fuzzy regular expressions
Li and Pedrycz have proved fundamental results that provide different equivalent ways to represent fuzzy languages with membership values in a lattice-ordered monoid, and generalize the well-known results of the classical theory of formal languages. In ...
Comments