research-article

Semantics-Assisted Deep Web Query Interface Classification

Author:
Chichang Jou

Department of Information Management, Tamkang University, 25137 Tamsui, New Taipei City, Taiwan

Department of Information Management, Tamkang University, 25137 Tamsui, New Taipei City, Taiwan
View Profile

C3S2E '15: Proceedings of the Eighth International C* Conference on Computer Science & Software EngineeringJuly 2015Pages 70–78https://doi.org/10.1145/2790798.2790810

Published:13 July 2015Publication History

C3S2E '15: Proceedings of the Eighth International C* Conference on Computer Science & Software Engineering

Pages 70–78

ABSTRACT

Huge amounts of structured data sources are hidden in the databases behind web forms. Volumes of deep web contents were estimated to be around 500 times those of surface web. However, many web forms are not deep web query interfaces. To retrieve contents in the web databases, an important task is to identify those web forms that are deep web query interfaces. Deep web contents normally are associated with a specific domain, and many domain semantics are embedded in the web forms. Additionally, returned HTML pages of deep web queries contain particular patterns, which could assist identifying query interfaces. Thus, we collect the following semantics to assist the classification: (1) feature words: for non-query forms and for keyword fields in deep web query interfaces; (2) common fields in a particular domain: their valid values and relationships, and their synonyms. We design and implement a Semantics-Assisted deep Web Query Interface Classifier (SAWQIC) system based on heuristics. In the pre-query analysis of SAWQIC, feature words of non-query form attributes are combined with heuristics to filter out non-query forms. For web forms passing the filtering, we utilize semantics in filling in valid input data for their components to submit the form. In the post-query analysis of SAWQIC, we then use heuristics in analyzing the returned HTML pages to identify the deep web query interfaces. The SAWQIC system is evaluated against web forms for the "Book" and "Job" domains. The experimental results illustrate that SAWQIC could generate highly effective classification measures.

References

Barbosa, L. & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. Proceedings of the 19th Brazilian Symposium on Databases (SBBD), pp. 309--321.Google Scholar
Barbosa, L. & Freire, J. (2007). Combining classifiers to identify online databases. Proceedings of the 16th International Conference on World Wide Web, pp. 431--440. Google ScholarDigital Library
Bergholz, A. & Chidlovskii, B. (2003). Crawling for domain-specific hidden web resources. Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE), pp. 125--133. Google ScholarDigital Library
Caverlee, J., Liu, L. & Buttler, D. (2004). Probe, cluster, and discover: focused extraction of qa-pagelets from the deep web. Proceedings of the 28th International Conference on Very Large Data Bases, pp. 103--114. Google ScholarDigital Library
Cope, J., Craswell, N. & Hawking, D. (2003). Automated discovery of search interfaces on the web. Proceedings of the 14th Australasian Database Conference, pp. 181--189. Google ScholarDigital Library
de Viana, I. F., Hernandez, I., Jiménez, P., Rivero, C. R., & Sleiman, H. A. (2010). Integrating Deep-Web Information Sources, Trends in Practical Applications of Agents and Multiagent Systems, Advances in Intelligent and Soft Computing, Volume 71, pp 311--320.Google ScholarCross Ref
He, H., Meng, W., Yu, C. & Wu, Z. (2005). Constructing interface schemas for search interfaces of web databases. Proceedings of the 6th International Conference on Web Information Systems Engineering, pp. 29--42. Google ScholarDigital Library
Hedley, Y.L., Younas, M., James, A. & Sanderson, M. (2004). A two-phase sampling technique for information extraction from hidden web databases. Proceedings of the 6th annual ACM international workshop on Web information and data management, pp. 1--8. Google ScholarDigital Library
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A. & Halevy, A. (2008). Google's deep-web crawl. Proceedings of the VLDB Endowment, 1(2), pp. 1241--1252. Google ScholarDigital Library
Nguyen, H., Nguyen, T. & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment, 1(1), pp. 684--694. Google ScholarDigital Library
Pavai, G. & Geetha, T. V. (2014). A Bootstrapping Approach to Classification of Deep Web Query Interfaces. Int. J. on Recent Trends in Engineering and Technology, Volume 11(1), pp. 1--9..Google Scholar
Shu, L., Meng, W., He, H. & Yu, C. (2007). Querying capability modeling and construction of deep web sources. Proceedings of the 8th International Conference on Web Information Systems Engineering, pp. 13--25. Google ScholarDigital Library
Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F. H., Cai. H. & Huang, T. (2013). Understanding Query Interfaces by Statistical Parsing. ACM Transactions on the Web, Volume 7(2), Article no. 8. Google ScholarDigital Library
Wang, S., Upadhyaya, B., Zou, Y., Keivanloo, I., Ng, J. & Ng, T. (2014). Automatic Propagation of User Inputs in Service Composition for End-users. Proceedings of IEEE International Conference on Web Services, pp. 73--80. Google ScholarDigital Library
Wu, P., Wen, J.-R., Liu, H. & Ma, W.-Y. (2006). Query selection techniques for efficient crawling of structured web sources. Proceedings of the 22nd International Conference on Data Engineering, pp. 47--47. Google ScholarDigital Library
Zhang, Z., He, B. & Chang, KCC. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of SIGMOD international conference on Management of data, pp. 107--118. Google ScholarDigital Library

Index Terms

Semantics-Assisted Deep Web Query Interface Classification

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
A framework for incremental deep web crawler based on URL classification
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

With the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through ...
Read More
Deep Web Query Interface Integration Based on Incremental Schema Matching and Merging
MISNC, SI, DS 2016: Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics 2016, Data Science 2016

Data hidden inside the deep web are of much higher quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Unfortunately, deep web data from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
C3S2E '15: Proceedings of the Eighth International C* Conference on Computer Science & Software Engineering
July 2015
166 pages
ISBN:9781450334198
DOI:10.1145/2790798
General Chair:
Bipin C. Desai
Concordia University, Canada
,
Program Chair:
Motomichi Toyoma
Keio University, Japan
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 July 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep Web
Heuristics
Query Interface Classification
Semantics
Web Database
Web Mining
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate12of42submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 93
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Semantics-Assisted Deep Web Query Interface Classification

C3S2E '15: Proceedings of the Eighth International C* Conference on Computer Science & Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

A framework for incremental deep web crawler based on URL classification

Deep Web Query Interface Integration Based on Incremental Schema Matching and Merging

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Semantics-Assisted Deep Web Query Interface Classification

C3S2E '15: Proceedings of the Eighth International C* Conference on Computer Science & Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

A framework for incremental deep web crawler based on URL classification

Deep Web Query Interface Integration Based on Incremental Schema Matching and Merging

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media