ABSTRACT
Huge amounts of structured data sources are hidden in the databases behind web forms. Volumes of deep web contents were estimated to be around 500 times those of surface web. However, many web forms are not deep web query interfaces. To retrieve contents in the web databases, an important task is to identify those web forms that are deep web query interfaces. Deep web contents normally are associated with a specific domain, and many domain semantics are embedded in the web forms. Additionally, returned HTML pages of deep web queries contain particular patterns, which could assist identifying query interfaces. Thus, we collect the following semantics to assist the classification: (1) feature words: for non-query forms and for keyword fields in deep web query interfaces; (2) common fields in a particular domain: their valid values and relationships, and their synonyms. We design and implement a Semantics-Assisted deep Web Query Interface Classifier (SAWQIC) system based on heuristics. In the pre-query analysis of SAWQIC, feature words of non-query form attributes are combined with heuristics to filter out non-query forms. For web forms passing the filtering, we utilize semantics in filling in valid input data for their components to submit the form. In the post-query analysis of SAWQIC, we then use heuristics in analyzing the returned HTML pages to identify the deep web query interfaces. The SAWQIC system is evaluated against web forms for the "Book" and "Job" domains. The experimental results illustrate that SAWQIC could generate highly effective classification measures.
- Barbosa, L. & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. Proceedings of the 19th Brazilian Symposium on Databases (SBBD), pp. 309--321.Google Scholar
- Barbosa, L. & Freire, J. (2007). Combining classifiers to identify online databases. Proceedings of the 16th International Conference on World Wide Web, pp. 431--440. Google ScholarDigital Library
- Bergholz, A. & Chidlovskii, B. (2003). Crawling for domain-specific hidden web resources. Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE), pp. 125--133. Google ScholarDigital Library
- Caverlee, J., Liu, L. & Buttler, D. (2004). Probe, cluster, and discover: focused extraction of qa-pagelets from the deep web. Proceedings of the 28th International Conference on Very Large Data Bases, pp. 103--114. Google ScholarDigital Library
- Cope, J., Craswell, N. & Hawking, D. (2003). Automated discovery of search interfaces on the web. Proceedings of the 14th Australasian Database Conference, pp. 181--189. Google ScholarDigital Library
- de Viana, I. F., Hernandez, I., Jiménez, P., Rivero, C. R., & Sleiman, H. A. (2010). Integrating Deep-Web Information Sources, Trends in Practical Applications of Agents and Multiagent Systems, Advances in Intelligent and Soft Computing, Volume 71, pp 311--320.Google ScholarCross Ref
- He, H., Meng, W., Yu, C. & Wu, Z. (2005). Constructing interface schemas for search interfaces of web databases. Proceedings of the 6th International Conference on Web Information Systems Engineering, pp. 29--42. Google ScholarDigital Library
- Hedley, Y.L., Younas, M., James, A. & Sanderson, M. (2004). A two-phase sampling technique for information extraction from hidden web databases. Proceedings of the 6th annual ACM international workshop on Web information and data management, pp. 1--8. Google ScholarDigital Library
- Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A. & Halevy, A. (2008). Google's deep-web crawl. Proceedings of the VLDB Endowment, 1(2), pp. 1241--1252. Google ScholarDigital Library
- Nguyen, H., Nguyen, T. & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment, 1(1), pp. 684--694. Google ScholarDigital Library
- Pavai, G. & Geetha, T. V. (2014). A Bootstrapping Approach to Classification of Deep Web Query Interfaces. Int. J. on Recent Trends in Engineering and Technology, Volume 11(1), pp. 1--9..Google Scholar
- Shu, L., Meng, W., He, H. & Yu, C. (2007). Querying capability modeling and construction of deep web sources. Proceedings of the 8th International Conference on Web Information Systems Engineering, pp. 13--25. Google ScholarDigital Library
- Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F. H., Cai. H. & Huang, T. (2013). Understanding Query Interfaces by Statistical Parsing. ACM Transactions on the Web, Volume 7(2), Article no. 8. Google ScholarDigital Library
- Wang, S., Upadhyaya, B., Zou, Y., Keivanloo, I., Ng, J. & Ng, T. (2014). Automatic Propagation of User Inputs in Service Composition for End-users. Proceedings of IEEE International Conference on Web Services, pp. 73--80. Google ScholarDigital Library
- Wu, P., Wen, J.-R., Liu, H. & Ma, W.-Y. (2006). Query selection techniques for efficient crawling of structured web sources. Proceedings of the 22nd International Conference on Data Engineering, pp. 47--47. Google ScholarDigital Library
- Zhang, Z., He, B. & Chang, KCC. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of SIGMOD international conference on Management of data, pp. 107--118. Google ScholarDigital Library
Index Terms
- Semantics-Assisted Deep Web Query Interface Classification
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
A framework for incremental deep web crawler based on URL classification
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part IIWith the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through ...
Deep Web Query Interface Integration Based on Incremental Schema Matching and Merging
MISNC, SI, DS 2016: Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics 2016, Data Science 2016Data hidden inside the deep web are of much higher quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Unfortunately, deep web data from ...
Comments