Abstract
The Web is a steadily evolving resource comprising much more than mere HTML pages. With its ever-growing data sources in a variety of formats, it provides great potential for knowledge discovery. In this article, we shed light on some interesting phenomena of the Web: the deep Web, which surfaces database records as Web pages; the Semantic Web, which defines meaningful data exchange formats; XML, which has established itself as a lingua franca for Web data exchange; and domain-specific markup languages, which are designed based on XML syntax with the goal of preserving semantics in targeted domains. We detail these four developments in Web technology, and explain how they can be used for data mining. Our goal is to show that all these areas can be as useful for knowledge discovery as the HTML-based part of the Web.
- Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. California: Morgan Kaumann, 2000. Google ScholarDigital Library
- Aggarwal, C. C., Ta, N., Wang, J., Feng, J., Zaki, M.J: Xproj: a framework for projected structural clustering of XML documents. In Proc. KDD, pp. 46--55, 2007. Google ScholarDigital Library
- Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In Proc. SIGMOD, pp. 207--216, 1993. Google ScholarDigital Library
- Antoniou, G., F. van Harmelen. A Semantic Web Primer (Cooperative Information Systems). The MIT Press, April 2004. Google ScholarDigital Library
- Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. In Proc. SIGMOD, pp. 337--348, 2003. Google ScholarDigital Library
- Auer, S., Bizer, C., Lehmann, J., Kobilarov, G., Cyganiak, R., Ives, Z.: DBpedia: A Nucleus for a Web of Open Data. In Proc. ISWC, pp. 722--735, 2007. Google ScholarDigital Library
- Aumueller, D., Do, H. H., Massmann, S., Rahm, E. Schema and ontology matching with COMA++. In Proc. SIGMOD, pages 906--908, 2005. Google ScholarDigital Library
- Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: Distributed web-of-data-scale entity matching. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM), New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Barbosa, L., Freire, J.: Searching for hidden-Web databases. In Proc. WebDB, pp. 1--6, 2005.Google Scholar
- Begley, E.F.: MatML Version 3.0 Schema. NIST 6939, National Institute of Standards and Technology Report, USA, Jan 2003.Google Scholar
- Bhattacharya, I., Getoor, L. Collective entity resolution in relational data. ACM TKDD, 1, 03 2007. Google ScholarDigital Library
- Bizer, C., Heath, T., Idehen, K., Berners-Lee, T.: Linked data on the web (LDOW2008). In Proc. WWW, 2008, pp. 1265--1266. Google ScholarDigital Library
- Bleiholder, J., Naumann, F. Data fusion. ACM Computing Surveys, 41(1), 2008. Google ScholarDigital Library
- Boag, S., Fernandez, M., Florescu, D., Robie J., Simeon, J.: XQuery 1.0: An XML Query Language. W3C Working Draft, Nov 2003.Google Scholar
- Bray, T., Hollander, D., Layman, A., Tobin, R.: Namespaces in XML 1.0 (Second Edition). W3C Recommendation, Aug 2006.Google Scholar
- BrightPlanet: The deep Web: Surfacing hidden value. White Paper, Jul 2000.Google Scholar
- Carlisle, D., Ion, P., Miner, R., Poppelier, N.: Mathematical Markup Language (MathML). WorldWideWeb Consortium, 2001.Google Scholar
- Caverlee, J., Liu, L., Buttler, D.: Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep Web. In Proc. ICDE, pp. 103--115, 2004. Google ScholarDigital Library
- Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(11--16):1623--1640, 1999. Google ScholarDigital Library
- Charles, L., Clarke, A., Craswell, N., Soboroff, I., and Voorhees, E.: Overview of the TREC Web Track, 2011.Google Scholar
- Chi, Y., Nijssen, S., Muntz, R.R., Kok, J.N.: Frequent subtree mining -- an overview. Fundamenta Informaticae, 66(1-2):161--228, 2005. Google ScholarDigital Library
- Chuang, S.L., Chang, K.C., Zhai, C.: Context-aware wrapping: Synchronized data extraction. In Proc. VLDB, pp. 699--710, 2007. Google ScholarDigital Library
- Chang, K.C., He, B., Zhang, Z.: Toward large scale integration: Building a metaquerier over databases on the Web. In Proc. CIDR, pp. 44--55, 2005.Google Scholar
- Cimiano, P., Hotho, A., Staab., S.: Comparing Conceptual, Divisive and Agglomerative Clustering for Learning Taxonomies from Text. In ECAI '04. IOS Press, 2004.Google Scholar
- Clark, J.: XSL Transformations (XSLT). W3C Recommendation, Nov 1999.Google Scholar
- Clark, J., DeRose, S.: XML Path Language (XPath). W3C Recommendation, Nov 1999.Google Scholar
- Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large Web sites. In Proc. VLDB, pp. 109--118, 2001. Google ScholarDigital Library
- d'Amato, C., Fanizzi, N., Esposito, F.: Inductive learning for the Semantic Web: What does it buy? Semant. web, 1(1,2), Apr. 2010. Google ScholarDigital Library
- David, J., Guillet, F., Briand, H.: Association Rule Ontology Matching Approach. Int. J. Semantic Web Inf. Syst., 3(2), 2007.Google Scholar
- Davidson, S., Fan, W., Hara, C., Qin, J.: Propagating XML Constraints to Relations. In Proc. ICDE, pp. 543--552, 2003.Google ScholarCross Ref
- Dehaspe, L., Toironen, H.: Discovery of relational association rules. In Relational Data Mining. Springer- Verlag New York, Inc., 2000. Google ScholarDigital Library
- Dehaspe, L., Toivonen, H.: Discovery of frequent DATALOG patterns. Data Min. Knowl. Discov., 3(1), Mar. 1999. Google ScholarDigital Library
- Denoyer, L., Gallinari, P..: Report on the xml mining track at inex 2007. ACM SIGIR Forum, pp. 22--28, 2008. Google ScholarDigital Library
- Ding, L., Shinavier, J., Shangguan, Z., McGuinness D. L.: SameAs networks and beyond: Analyzing deployment status and implications of owl:sameAs in linked data. In Proc. ISWC, pages 142--147, 2010. Google ScholarDigital Library
- Elmagarmid, A., Ipeirotis, P., Verykios, V. Duplicate record detection: A survey. IEEE TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
- Erik, W., Robert, J.: Xml fever. Communications of the ACM, 51(7):40--46, 2003. Google ScholarDigital Library
- Ferrara, A., Lorusso, D., Montanelli, S.: Automatic identity recognition in the semantic web. In Proc. IRSW, 2008.Google Scholar
- Galarraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: AMIE: Association Rule Mining under Incomplete Evidence in Ontological Knowledge Bases. In WWW 2013, 2013.Google Scholar
- Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A system for extracting document type descriptors from xml documents. In Proc. SIGMOD, pp. 165--176, 2000. Google ScholarDigital Library
- Glaser, H., Jaffri, A., Millard, I.: Managing co-reference on the semantic Web. In Proc. LDOW, 2009.Google Scholar
- Goethals, B., Van den Bussche, J.: Relational Association Rules: Getting WARMER. In Pattern Detection and Discovery, volume 2447. Springer Berlin / Heidelberg, 2002. Google ScholarDigital Library
- Gracia, J., d'Aquin, M. Mena, E.: Large scale integration of senses for the semantic Web. In Proc. WWW, pages 611--620, 2009. Google ScholarDigital Library
- Grimnes, G.A., Edwards, P., Preece, A.D.: Learning Meta-descriptions of the FOAF Network. In ISWC'04, 2004.Google ScholarDigital Library
- Guo, J., Araki, K., Tanaka, K., Sato, J., Suzuki, M., Takada, A., Suzuki, T., Nakashima, Y., Yoshihara, H.: MML (Medical Markup Language) Version 2.3 - XML based Standard for Medical Data Exchange/Storage. Journal of Medical Systems, 27(4):357--366, 2003. Google ScholarDigital Library
- Halpin, H., Hayes, P., McCusker, J.P., McGuinness, D., Thompson, H.S.: When owl:sameAs isn't the same: An analysis of identity in linked data. In Proc. ISWC, pages 305--320, 2010. Google ScholarDigital Library
- He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep Web: A survey. Communications of the ACM, 50(2):94--101, 2007. Google ScholarDigital Library
- Hellmann, S., Lehmann, J., Auer, S.: Learning of OWL Class Descriptions on Very Large Knowledge Bases. Int. J. Semantic Web Inf. Syst., 5(2), 2009.Google ScholarCross Ref
- A. Hogan.: Performing object consolidation on the semantic Web data graph. In Proc. I3, 2007.Google Scholar
- Hogan, A., Polleres, A., Umbrich, J., Zimmermann, A.: Some entities are more equal than others: statistical methods to consolidate linked data. In Proc. NeFoRS, 2010.Google Scholar
- Horrocks, I., Patel-Schneider, P.: Reducing OWL entailment to description logic satisfiability. Web Semantics: Science, Services and Agents on the World Wide Web, 1(4):345--357, 2004. Google ScholarDigital Library
- Hu, W., Chen, J., Qu, Y.: A self-training approach for resolving object coreference on the semantic Web. In Proc. WWW, pages 87--96, 2011. Google ScholarDigital Library
- Hu, w., Chen, J., Zhang, H., Qu, Y.: How matchable are four thousand ontologies on the semantic Web. In Proc. ESWC, pages 290--304, 2011. Google ScholarDigital Library
- Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of isomorphism. In Proc. ICDM, 2003. Google ScholarDigital Library
- Inokuchi, A., Washio, T., Motoda, H.: A general framework for mining frequent subgraphs from labeled graphs. Fundamenta Informaticae, 66(1-2):53--82, 2005. Google ScholarDigital Library
- Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proc. VLDB, pp. 394--405, 2002. Google ScholarDigital Library
- Isaac, A., van der Meij, L., Schlobach, S., Wang, S.: An empirical study of instance-based ontology matching. In Proc. ISWC, pages 253--266, 2007. Google ScholarDigital Library
- Jean-Mary, Y., Shironoshita, E., Kabuka, M.: Ontology matching with semantic verification. J. Web Semantics, 7(3):235--251, 2009. Google ScholarDigital Library
- Jiang, T., Tan, A.H., Wang, K.: Mining Generalized Associations of Semantic Relations from Textual Web Content. IEEE Trans. Knowl. Data Eng., 19(2), 2007. Google ScholarDigital Library
- Jozefowska, J., Lawrynowicz, A., Lukaszewski, T.: The role of semantics in mining frequent patterns from knowledge bases in description logics with rules. Theory Pract. Log. Program., 10(3), 2010. Google ScholarDigital Library
- Kuramochi, M., Karypis, G.: Frequent Subgraph Discovery. In ICDM '01. IEEE Computer Society, 2001. Google ScholarDigital Library
- Kutty, S., Nayak, R.: Clustering xml documents using closed frequent subtrees -- a structural similarity approach. In 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX, pp. 183--194, 2008. Google ScholarDigital Library
- Kutty, S., Nayak, R.: Frequent pattern mining on xml documents. Chapter 14 In Handbook of Research on Text and Web Mining Technologies, pp. 227-248, Idea Group Inc., USA, 2008.Google Scholar
- Landauer, T.K., Foltz, P.N., Laham, D.: An introduction to latent semantic analysis. Discourse Processes, (25):259--284, 1998.Google Scholar
- Lee, D., Sebastian Seung, H.: Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems 13, pp. 556--562, 2000.Google Scholar
- Lehmann, J.: DL-Learner: Learning Concepts in Description Logics. Journal of Machine Learning Research (JMLR), 10, 2009. Google ScholarDigital Library
- Li, J., Tang, J., Li, Y., and Luo, Q.: Rimom: A dynamic multistrategy ontology alignment framework. IEEE TKDE, 21(8):1218--1232, 2009. Google ScholarDigital Library
- Madhavan, J., Halevy, A.Y., Cohen, S., Dong, X., Jeffery, S.R., Ko, D., Yu, C.: Structured data meets the Web: A few observations. IEEE Data Engineering Bulletin, 29(4):19--26, 2006.Google Scholar
- Maedche, A., Staab, S.: Discovering Conceptual Relations from Text. In ECAI'00, 2000.Google Scholar
- Maedche, A., Zacharias, V.: Clustering Ontology-Based Metadata in the Semantic Web. In PKDD '02, 2002. Google ScholarDigital Library
- McGuinness, D.L., Fikes, R., Rice, J., Wilder, S.: An Environment for Merging and Testing Large Ontologies. In KR2000, 2000.Google Scholar
- Muggleton, S.: Inverse entailment and progol. New Generation Comput., 13(3&4), 1995.Google Scholar
- Murray-Rust, P., Rzepa, H.S.: Chemical Markup, XML, and the Worldwide Web Basic Principles. Journal of Chemical Informatics and Computer Science, 39:928--942, 1999.Google ScholarCross Ref
- Nayak, R.: Fast and effective clustering of XML data using structural information. Knowledge and Information Systems, 14(2):197--215, 2008. Google ScholarDigital Library
- Nayak, R.: XML data mining: Process and applications. Chapter 15 In Handbook of Research on Text and Web Mining Technologies, pp. 249 - 272, Idea Group Inc., USA, 2008.Google Scholar
- Nayak, R., Iryadi, W.: XML schema clustering with semantic and hierarchical similarity measures. Knowledge-based Systems, 20:336--349, 2007. Google ScholarDigital Library
- Nayak, R., Tran., T: A progressive clustering algorithm to group the XML data by structural and semantic similarity. International Journal of Pattern Recognition and Artificial Intelligence, 21(4):723--743, 2007.Google ScholarCross Ref
- Nayak, R., and Zaki, M.J. Knowledge Discovery from XML documents: PAKDD 2006 Workshop Proceedings, volume 3915, 2006. Google ScholarDigital Library
- Network Working Group. Uniform Resource Identifier (URI): Generic Syntax, 2005. http://tools.ietf.org/html/rfc3986.Google Scholar
- Nebot, V., Berlanga, R.: Finding association rules in semantic web data. Knowl.-Based Syst., 25(1), 2012. Google ScholarDigital Library
- Nebot, V., Llavorim, R.B.: Mining Association Rules from Semantic Web Data. In IEA/AIE (2), 2010. Google ScholarDigital Library
- Noessner, J., Niepert, M., Meilicke, C., and Stuckenschmidt, H.: Leveraging terminological structure for object reconciliation. In Proc. ESWC, pages 334--348, 2010. Google ScholarDigital Library
- Noy, N.F., Musen, M.A.: PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. In AAAI/IAAI '00. AAAI Press, 2000. Google ScholarDigital Library
- Raghavan, S., Garcia-Molina, H.: Crawling the hidden Web. In Proc. VLDB, pp. 129--138, 2001. Google ScholarDigital Library
- Ru, Y., Horowitz, E.: Indexing the invisible web: a survey. Online Information Review, 29(3):249--265, 2009.Google ScholarCross Ref
- Saïs, F., Pernelle, N., Rousset, M.C.: L2R: A logical method for reference reconciliation. In Proc. AAAI, pages 329--334, 2007. Google ScholarDigital Library
- Saïs, F., Pernelle, N., Rousset, M.C.: Combining a logical and a numerical method for data reconciliation. J. Data Semantics, 12:66--94, 2009. Google ScholarDigital Library
- Salton, G., McGill, M.J.: Introduction to Modern information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
- Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order Horn clauses from web text. In EMNLP '10. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- Senellart, P.: Comprendre le Web caché. Understanding the Hidden Web. PhD thesis, Université Paris XI, Orsay, France, December 2007.Google Scholar
- Senellart, P., Mittal, A., Muschick, D., Gilleron, R., Tommasi, M.: Automatic wrapper induction from hidden-Web sources with domain knowledge. In WIDM, pp. 9--16, 2008. Google ScholarDigital Library
- Suchanek, F.M., Abiteboul, S., Senellart, P.: PARIS: Probabilistic Alignment of Relations, Instances, and Schema. PVLDB, 5(3):157--168, 2011. Google ScholarDigital Library
- Suchanek, F.M., Kasneci, G., Weikum, G: YAGO: A core of semantic knowledge. Unifying WordNet and Wikipedia. In Proc. WWW, pp. 697--706, 2007. Google ScholarDigital Library
- Talukdar, P.P., Wijaya, D., Mitchell, T.: Acquiring temporal constraints between relations. In CIKM'12, 2012. Google ScholarDigital Library
- Taylor, P. and Isard A.: SSML: A Speech Synthesis Markup Language. Speech Communication, 21(1-2):123--133, 1997. Google ScholarDigital Library
- Tran, T., Nayak, R., Bruza, P.: Combining structure and content similarities for xml document clustering. In AusDM, pp. 219--226, 2008. Google ScholarDigital Library
- Totten, G., Bates, C., Clinton, N.: Handbook of Quench Technology and Quenchants. ASM International, 1993.Google Scholar
- Tummarello, G., Cyganiak, R., Catasta, M., Danielczyk, S., Delbru, R., Decker, R.: Sig.ma: live views on the web of data. In Proc. WWW, pp. 1301--1304, 2010. Google ScholarDigital Library
- Udrea, O., Getoor, L., Miller, R.J.: Leveraging data and structure in ontology integration. In Proc. SIGMOD, pp. 449--460, 2007. Google ScholarDigital Library
- Varde, A., Begley, E., Fahrenholz, S.: MatML: XML for Information Exchange with Materials Property Data Proc. ACM KDD DM-SSP Workshop, 2006. Google ScholarDigital Library
- Varde, A., Maniruzzaman, M., Sisson Jr., R.: QuenchML: A Semantics-Preserving Markup Language for Knowledge Representation in Quenching. AIEDAM Journal, Cambridge University Press, Volume 27, pp. 65--82, 2012. Google ScholarDigital Library
- Varde, A., Rundensteiner, E., Mani, M., Maniruzzaman, M., Sisson Jr., R.: Augmenting MatML with Heat Treating Semantics. ASM International's Symposium on Web-Based Materials Property Databases, 2004.Google Scholar
- Varde A., Rundensteiner, E., Fahrenholz S.: XML Based Markup Languages for Specific Domains. Book Chapter to appear in Web Based Support Systems, Springer-Verlag, UK, pp. 215--238, 2010.Google Scholar
- Villaverde, J., Persson, A., Godoy, D., Amandi, A.: Supporting the discovery and labeling of non-taxonomic relationships in ontology learning. Expert Systems with Applications, 36(7), 2009. Google ScholarDigital Library
- Völker, J., Niepert, M.: Statistical schema induction. In ESWC'11, 2011.Google ScholarCross Ref
- Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maintaining links on the Web of data. In Proc. ISWC, pp. 650--665, 2009. Google ScholarDigital Library
- Wang, S., Englebienne, G., Schlobach, S.: Learning concept mappings from instance similarity. In Proc. ISWC, pp. 339--355, 2008. Google ScholarDigital Library
- World Wide Web Consortium. W3C Semantic Web Activity, 1994. http://www.w3.org/2001/sw/Google Scholar
- World Wide Web Consortium. RDF Primer (W3C Recommendation 2004-02-10). http://www.w3.org/TR/rdf-primer/, 2004.Google Scholar
- World Wide Web Consortium. RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation 2004-02-10.Google Scholar
- RDF/XML Syntax Specification (Revised), W3C Recommendation, 2004. http://www.w3.org/TR/rdf-syntax-grammar/Google Scholar
- World Wide Web Consortium. SPARQL Query Language for RDF (W3C Recommendation 2008-01-15). http://www.w3.org/TR/rdf-sparql-query/.Google Scholar
- W3C. Web Services Glossary, February 2004. http://www.w3.org/TR/ws-gloss/Google Scholar
- Wu, W. Doan, A., Yu, C.T., Meng, W.: Bootstrapping domain ontology for semantic Web services from source Web sites. In Technologies for E-Services, pp. 11--22, 2005. Google ScholarDigital Library
- World Wide Web Consortium. XML Schema Part 2: Datatypes Second Edition, 2004. http://www.w3.org/TR/xmlschema-2/Google Scholar
- Yokota, K., Kunishima, T., Liu, B.: Semantic Extensions of XML for Advanced Applications. IEEE Australian Computer Science Communications Workshop on Information Technology for Virtual Enterprises, 23(6):49-57, 2001. Google ScholarDigital Library
- Zaki, M.J.: Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8):1021--1035, 2005. Google ScholarDigital Library
- Zhong, C., Bakshi A. and Prasanna, V.: ModelML: a Markup Language for Automatic Model Synthesis. IEEE International Conference on Information Reuse and Integration pp. 317--342, 2007.Google ScholarCross Ref
Index Terms
- Discovering interesting information with advances in web technology
Recommendations
Discovering interesting navigations on a web site using SAMI
ITWP'03: Proceedings of the 2003 international conference on Intelligent Techniques for Web PersonalizationIn this article, a new algorithm called Sequence Alignment Method extended with an Interestingness Measure (SAMI ) is illustrated for mining navigation patterns on a web site. Through log file analysis, SAMI distinguishes interesting patterns (i.e. ...
Discovering interesting information in XML data with association rules
SAC '03: Proceedings of the 2003 ACM symposium on Applied computingData mining algorithms are designed to extract interesting information from large amounts of data. They usually assume that source data are in relational (tabular) from. However, the recent success of XML as a standard to represent semi-structured data ...
Comments