skip to main content
10.1145/1242572.1242631acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Combining classifiers to identify online databases

Published:08 May 2007Publication History

ABSTRACT

We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.

References

  1. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proc. of SBBD, pages 309--321, 2004.Google ScholarGoogle Scholar
  3. L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.Google ScholarGoogle Scholar
  4. L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.Google ScholarGoogle ScholarCross RefCross Ref
  5. P. Bennett, S. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proceedings of SIGIR, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. N. Bennett, S. T. Dumais, and E. Horvitz. The combination of text classifiers using reliability indicators. Information Retrieval, 8(1):67--100, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bergholz and B. Chidlovskii. Crawling for Domain-Specific Hidden Web Resources. In Proceedings of WISE, pages 125--133, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Brightplanet's searchable databases directory. http://www.completeplanet.com.Google ScholarGoogle Scholar
  9. S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pages 148--159, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31(11-16):1623--1640, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proc. of CIDR, pages 44--55, 2005.Google ScholarGoogle Scholar
  12. J. Cope, N. Craswell, and D. Hawking. Automated Discovery of Search Interfaces on the Web. In Proceedings of ADC, pages 181--189, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of VLDB, pages 527--534, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Even-Zohar and D. Roth. A sequential model for multi-class classification. In Empirical Methods in Natural Language Processing, 2001.Google ScholarGoogle Scholar
  15. M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.Google ScholarGoogle Scholar
  16. S. Gangaputra and D. Geman. A design principle for coarse-to-fine classification. In Proceedings of CVPR, pages 1877--1884, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Gravano, P. G. Ipeirotis, and M. Sahami. Qprober: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1):1--41, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. He and K. C.C. Chang. Statistical Schema Matching across Web Query Interfaces. In Proceedings of ACM SIGMOD, pages 217--228, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. He, T. Tao, and K. C.C. Chang. Organizing structured web sources by query schemas: a clustering approach. In Proc. of CIKM, pages 22--31, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Heisele, T. Serreb, S. Prenticeb, and T. Poggiob. Hierarchical Classification and Feature Reduction for Fast face Detection with Support Vector Machines. Pattern Recognition, 36(9), 2003.Google ScholarGoogle Scholar
  22. A. Hess and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proceedings of IIWeb, pages 111--116, 2003.Google ScholarGoogle Scholar
  23. W. Hsieh, J. Madhavan, and R. Pike. Data management projects at Google. In Proceedings of ACM SIGMOD, pages 725--726, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In Proceedings of VLDB, pages 129--138, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of ICML, pages 335--343, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Ru and E. Horowitz. Indexing the invisible Web: a survey. Online Information Review, 29(3):249--265, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  28. E. H. Simpson. Measurement of Diversity. Nature, 163:688, 1949.Google ScholarGoogle ScholarCross RefCross Ref
  29. S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proc. of CIDR, 2003.Google ScholarGoogle Scholar
  30. The UIUC Web integration repository. http://metaquerier.cs.uiuc.edu/repository.Google ScholarGoogle Scholar
  31. I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2nd edition, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In Proceedings of ICDE, page 47, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. In Proceedings of ACM SIGMOD, pages 95--106, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Combining classifiers to identify online databases

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '07: Proceedings of the 16th international conference on World Wide Web
        May 2007
        1382 pages
        ISBN:9781595936547
        DOI:10.1145/1242572

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 May 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

        Upcoming Conference

        WWW '24
        The ACM Web Conference 2024
        May 13 - 17, 2024
        Singapore , Singapore

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader