Article

Combining classifiers to identify online databases

Authors:

Luciano Barbosa,

Juliana FreireAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 431 - 440

https://doi.org/10.1145/1242572.1242631

Published: 08 May 2007 Publication History

Abstract

We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.

References

[1]

R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.

Digital Library

[2]

L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proc. of SBBD, pages 309--321, 2004.

[3]

L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.

[4]

L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.

[5]

P. Bennett, S. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proceedings of SIGIR, 2002.

Digital Library

[6]

P. N. Bennett, S. T. Dumais, and E. Horvitz. The combination of text classifiers using reliability indicators. Information Retrieval, 8(1):67--100, 2005.

Digital Library

[7]

A. Bergholz and B. Chidlovskii. Crawling for Domain-Specific Hidden Web Resources. In Proceedings of WISE, pages 125--133, 2003.

Digital Library

[8]

Brightplanet's searchable databases directory. http://www.completeplanet.com.

[9]

S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pages 148--159, 2002.

Digital Library

[10]

S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31(11-16):1623--1640, 1999.

Digital Library

[11]

K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proc. of CIDR, pages 44--55, 2005.

[12]

J. Cope, N. Craswell, and D. Hawking. Automated Discovery of Search Interfaces on the Web. In Proceedings of ADC, pages 181--189, 2003.

Digital Library

[13]

M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of VLDB, pages 527--534, 2000.

Digital Library

[14]

Y. Even-Zohar and D. Roth. A sequential model for multi-class classification. In Empirical Methods in Natural Language Processing, 2001.

[15]

M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.

[16]

S. Gangaputra and D. Geman. A design principle for coarse-to-fine classification. In Proceedings of CVPR, pages 1877--1884, 2006.

Digital Library

[17]

L. Gravano, P. G. Ipeirotis, and M. Sahami. Qprober: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1):1--41, 2003.

Digital Library

[18]

B. He and K. C.C. Chang. Statistical Schema Matching across Web Query Interfaces. In Proceedings of ACM SIGMOD, pages 217--228, 2003.

Digital Library

[19]

B. He, T. Tao, and K. C.C. Chang. Organizing structured web sources by query schemas: a clustering approach. In Proc. of CIKM, pages 22--31, 2004.

Digital Library

[20]

H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003.

Digital Library

[21]

B. Heisele, T. Serreb, S. Prenticeb, and T. Poggiob. Hierarchical Classification and Feature Reduction for Fast face Detection with Support Vector Machines. Pattern Recognition, 36(9), 2003.

[22]

A. Hess and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proceedings of IIWeb, pages 111--116, 2003.

[23]

W. Hsieh, J. Madhavan, and R. Pike. Data management projects at Google. In Proceedings of ACM SIGMOD, pages 725--726, 2006.

Digital Library

[24]

T. Mitchell. Machine Learning. McGraw Hill, 1997.

Digital Library

[25]

S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In Proceedings of VLDB, pages 129--138, 2001.

Digital Library

[26]

J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of ICML, pages 335--343, 1999.

Digital Library

[27]

Y. Ru and E. Horowitz. Indexing the invisible Web: a survey. Online Information Review, 29(3):249--265, 2005.

[28]

E. H. Simpson. Measurement of Diversity. Nature, 163:688, 1949.

[29]

S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proc. of CIDR, 2003.

[30]

The UIUC Web integration repository. http://metaquerier.cs.uiuc.edu/repository.

[31]

I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2nd edition, 2005.

Digital Library

[32]

P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In Proceedings of ICDE, page 47, 2006.

Digital Library

[33]

W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. In Proceedings of ACM SIGMOD, pages 95--106, 2004.

Digital Library

Cited By

Aiolli FConti MPicek SPolato MLiang KChen LLi NSchneider S(2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JCS-210041
Kaur SGeetha G(2020)SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis ServerIEEE Access10.1109/ACCESS.2020.30047568(117582-117592)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3004756
Mishra PKhurana A(2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
https://doi.org/10.1109/ICCPCCT.2018.8574286
Show More Cited By

Index Terms

Combining classifiers to identify online databases
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals

Recommendations

A Novel Architecture for Deep Web Crawler

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the ...
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Finding pages on the unarchived web
JCDL '14: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies---most of the Web is unarchived and therefore lost to posterity. In this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

ACM

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
702
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aiolli FConti MPicek SPolato MLiang KChen LLi NSchneider S(2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JCS-210041
Kaur SGeetha G(2020)SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis ServerIEEE Access10.1109/ACCESS.2020.30047568(117582-117592)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3004756
Mishra PKhurana A(2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
https://doi.org/10.1109/ICCPCCT.2018.8574286
Kaur SGeetha G(2018)Smart Focused Web Crawler for Hidden WebInformation and Communication Technology for Competitive Strategies10.1007/978-981-13-0586-3_42(419-427)Online publication date: 31-Aug-2018
https://doi.org/10.1007/978-981-13-0586-3_42
Bhakare UChatur P(2017)A review on extracting underlying content from deep web interfaces2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA)10.1109/ICIMIA.2017.7975609(234-237)Online publication date: Feb-2017
https://doi.org/10.1109/ICIMIA.2017.7975609
Bhakare UChatur P(2017)Content extraction from deep web interfaces2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2017.8203702(349-353)Online publication date: Apr-2017
https://doi.org/10.1109/ICECA.2017.8203702
(2017)Result Merging for Structured Queries on the Deep Web with Active Relevance Weight EstimationInformation Systems10.1016/j.is.2016.06.00564:C(93-103)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.is.2016.06.005
Marin-Castro HSosa Sosa V(2017)VR-TreeJournal of Intelligent Information Systems10.1007/s10844-017-0449-449:3(367-390)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s10844-017-0449-4
NIKITHA SV. SOWMYA D(2016)AN EFFICIENT SMARTCRAWLER FOR HARVESTING WEB INTERFACES OF A TWO-STAGE CRAWLERi-manager's Journal on Information Technology10.26634/JIT.5.4.103345:4(20)Online publication date: 2016
https://doi.org/10.26634/JIT.5.4.10334
Zhao FZhou JNie CHuang HJin H(2016)SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web InterfacesIEEE Transactions on Services Computing10.1109/TSC.2015.24149319:4(608-620)Online publication date: 1-Jul-2016
https://doi.org/10.1109/TSC.2015.2414931
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten