skip to main content
10.1145/1242572.1242631acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Combining classifiers to identify online databases

Published: 08 May 2007 Publication History

Abstract

We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.

References

[1]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[2]
L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proc. of SBBD, pages 309--321, 2004.
[3]
L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.
[4]
L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.
[5]
P. Bennett, S. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proceedings of SIGIR, 2002.
[6]
P. N. Bennett, S. T. Dumais, and E. Horvitz. The combination of text classifiers using reliability indicators. Information Retrieval, 8(1):67--100, 2005.
[7]
A. Bergholz and B. Chidlovskii. Crawling for Domain-Specific Hidden Web Resources. In Proceedings of WISE, pages 125--133, 2003.
[8]
Brightplanet's searchable databases directory. http://www.completeplanet.com.
[9]
S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pages 148--159, 2002.
[10]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31(11-16):1623--1640, 1999.
[11]
K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proc. of CIDR, pages 44--55, 2005.
[12]
J. Cope, N. Craswell, and D. Hawking. Automated Discovery of Search Interfaces on the Web. In Proceedings of ADC, pages 181--189, 2003.
[13]
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of VLDB, pages 527--534, 2000.
[14]
Y. Even-Zohar and D. Roth. A sequential model for multi-class classification. In Empirical Methods in Natural Language Processing, 2001.
[15]
M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.
[16]
S. Gangaputra and D. Geman. A design principle for coarse-to-fine classification. In Proceedings of CVPR, pages 1877--1884, 2006.
[17]
L. Gravano, P. G. Ipeirotis, and M. Sahami. Qprober: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1):1--41, 2003.
[18]
B. He and K. C.C. Chang. Statistical Schema Matching across Web Query Interfaces. In Proceedings of ACM SIGMOD, pages 217--228, 2003.
[19]
B. He, T. Tao, and K. C.C. Chang. Organizing structured web sources by query schemas: a clustering approach. In Proc. of CIKM, pages 22--31, 2004.
[20]
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003.
[21]
B. Heisele, T. Serreb, S. Prenticeb, and T. Poggiob. Hierarchical Classification and Feature Reduction for Fast face Detection with Support Vector Machines. Pattern Recognition, 36(9), 2003.
[22]
A. Hess and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proceedings of IIWeb, pages 111--116, 2003.
[23]
W. Hsieh, J. Madhavan, and R. Pike. Data management projects at Google. In Proceedings of ACM SIGMOD, pages 725--726, 2006.
[24]
T. Mitchell. Machine Learning. McGraw Hill, 1997.
[25]
S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In Proceedings of VLDB, pages 129--138, 2001.
[26]
J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of ICML, pages 335--343, 1999.
[27]
Y. Ru and E. Horowitz. Indexing the invisible Web: a survey. Online Information Review, 29(3):249--265, 2005.
[28]
E. H. Simpson. Measurement of Diversity. Nature, 163:688, 1949.
[29]
S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proc. of CIDR, 2003.
[30]
The UIUC Web integration repository. http://metaquerier.cs.uiuc.edu/repository.
[31]
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2nd edition, 2005.
[32]
P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In Proceedings of ICDE, page 47, 2006.
[33]
W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. In Proceedings of ACM SIGMOD, pages 95--106, 2004.

Cited By

View all
  • (2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
  • (2020)SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis ServerIEEE Access10.1109/ACCESS.2020.30047568(117582-117592)Online publication date: 2020
  • (2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
  • Show More Cited By

Index Terms

  1. Combining classifiers to identify online databases

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 May 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. hidden web
      2. hierarchical classifiers
      3. learning classifiers
      4. online database directories
      5. web crawlers

      Qualifiers

      • Article

      Conference

      WWW'07
      Sponsor:
      WWW'07: 16th International World Wide Web Conference
      May 8 - 12, 2007
      Alberta, Banff, Canada

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)On the feasibility of crawling-based attacks against recommender systemsJournal of Computer Security10.3233/JCS-21004130:4(599-621)Online publication date: 1-Jan-2022
      • (2020)SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis ServerIEEE Access10.1109/ACCESS.2020.30047568(117582-117592)Online publication date: 2020
      • (2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
      • (2018)Smart Focused Web Crawler for Hidden WebInformation and Communication Technology for Competitive Strategies10.1007/978-981-13-0586-3_42(419-427)Online publication date: 31-Aug-2018
      • (2017)A review on extracting underlying content from deep web interfaces2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA)10.1109/ICIMIA.2017.7975609(234-237)Online publication date: Feb-2017
      • (2017)Content extraction from deep web interfaces2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2017.8203702(349-353)Online publication date: Apr-2017
      • (2017)Result Merging for Structured Queries on the Deep Web with Active Relevance Weight EstimationInformation Systems10.1016/j.is.2016.06.00564:C(93-103)Online publication date: 1-Mar-2017
      • (2017)VR-TreeJournal of Intelligent Information Systems10.1007/s10844-017-0449-449:3(367-390)Online publication date: 1-Dec-2017
      • (2016)AN EFFICIENT SMARTCRAWLER FOR HARVESTING WEB INTERFACES OF A TWO-STAGE CRAWLERi-manager's Journal on Information Technology10.26634/JIT.5.4.103345:4(20)Online publication date: 2016
      • (2016)SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web InterfacesIEEE Transactions on Services Computing10.1109/TSC.2015.24149319:4(608-620)Online publication date: 1-Jul-2016
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media