skip to main content
10.1145/2790798.2790810acmotherconferencesArticle/Chapter ViewAbstractPublication PagesuccsConference Proceedingsconference-collections
research-article

Semantics-Assisted Deep Web Query Interface Classification

Authors Info & Claims
Published:13 July 2015Publication History

ABSTRACT

Huge amounts of structured data sources are hidden in the databases behind web forms. Volumes of deep web contents were estimated to be around 500 times those of surface web. However, many web forms are not deep web query interfaces. To retrieve contents in the web databases, an important task is to identify those web forms that are deep web query interfaces. Deep web contents normally are associated with a specific domain, and many domain semantics are embedded in the web forms. Additionally, returned HTML pages of deep web queries contain particular patterns, which could assist identifying query interfaces. Thus, we collect the following semantics to assist the classification: (1) feature words: for non-query forms and for keyword fields in deep web query interfaces; (2) common fields in a particular domain: their valid values and relationships, and their synonyms. We design and implement a Semantics-Assisted deep Web Query Interface Classifier (SAWQIC) system based on heuristics. In the pre-query analysis of SAWQIC, feature words of non-query form attributes are combined with heuristics to filter out non-query forms. For web forms passing the filtering, we utilize semantics in filling in valid input data for their components to submit the form. In the post-query analysis of SAWQIC, we then use heuristics in analyzing the returned HTML pages to identify the deep web query interfaces. The SAWQIC system is evaluated against web forms for the "Book" and "Job" domains. The experimental results illustrate that SAWQIC could generate highly effective classification measures.

References

  1. Barbosa, L. & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. Proceedings of the 19th Brazilian Symposium on Databases (SBBD), pp. 309--321.Google ScholarGoogle Scholar
  2. Barbosa, L. & Freire, J. (2007). Combining classifiers to identify online databases. Proceedings of the 16th International Conference on World Wide Web, pp. 431--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bergholz, A. & Chidlovskii, B. (2003). Crawling for domain-specific hidden web resources. Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE), pp. 125--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Caverlee, J., Liu, L. & Buttler, D. (2004). Probe, cluster, and discover: focused extraction of qa-pagelets from the deep web. Proceedings of the 28th International Conference on Very Large Data Bases, pp. 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cope, J., Craswell, N. & Hawking, D. (2003). Automated discovery of search interfaces on the web. Proceedings of the 14th Australasian Database Conference, pp. 181--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. de Viana, I. F., Hernandez, I., Jiménez, P., Rivero, C. R., & Sleiman, H. A. (2010). Integrating Deep-Web Information Sources, Trends in Practical Applications of Agents and Multiagent Systems, Advances in Intelligent and Soft Computing, Volume 71, pp 311--320.Google ScholarGoogle ScholarCross RefCross Ref
  7. He, H., Meng, W., Yu, C. & Wu, Z. (2005). Constructing interface schemas for search interfaces of web databases. Proceedings of the 6th International Conference on Web Information Systems Engineering, pp. 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hedley, Y.L., Younas, M., James, A. & Sanderson, M. (2004). A two-phase sampling technique for information extraction from hidden web databases. Proceedings of the 6th annual ACM international workshop on Web information and data management, pp. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A. & Halevy, A. (2008). Google's deep-web crawl. Proceedings of the VLDB Endowment, 1(2), pp. 1241--1252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Nguyen, H., Nguyen, T. & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment, 1(1), pp. 684--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Pavai, G. & Geetha, T. V. (2014). A Bootstrapping Approach to Classification of Deep Web Query Interfaces. Int. J. on Recent Trends in Engineering and Technology, Volume 11(1), pp. 1--9..Google ScholarGoogle Scholar
  12. Shu, L., Meng, W., He, H. & Yu, C. (2007). Querying capability modeling and construction of deep web sources. Proceedings of the 8th International Conference on Web Information Systems Engineering, pp. 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F. H., Cai. H. & Huang, T. (2013). Understanding Query Interfaces by Statistical Parsing. ACM Transactions on the Web, Volume 7(2), Article no. 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wang, S., Upadhyaya, B., Zou, Y., Keivanloo, I., Ng, J. & Ng, T. (2014). Automatic Propagation of User Inputs in Service Composition for End-users. Proceedings of IEEE International Conference on Web Services, pp. 73--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Wu, P., Wen, J.-R., Liu, H. & Ma, W.-Y. (2006). Query selection techniques for efficient crawling of structured web sources. Proceedings of the 22nd International Conference on Data Engineering, pp. 47--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zhang, Z., He, B. & Chang, KCC. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of SIGMOD international conference on Management of data, pp. 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Semantics-Assisted Deep Web Query Interface Classification

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          C3S2E '15: Proceedings of the Eighth International C* Conference on Computer Science & Software Engineering
          July 2015
          166 pages
          ISBN:9781450334198
          DOI:10.1145/2790798

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 July 2015

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate12of42submissions,29%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader