ABSTRACT
We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.
- R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999. Google ScholarDigital Library
- L. Barbosa and J. Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proc. of SBBD, pages 309--321, 2004.Google Scholar
- L. Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1--6, 2005.Google Scholar
- L. Barbosa and J. Freire. Organizing hidden-web databases by clustering visible web documents. In Proceedings of ICDE, 2007. To appear.Google ScholarCross Ref
- P. Bennett, S. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proceedings of SIGIR, 2002. Google ScholarDigital Library
- P. N. Bennett, S. T. Dumais, and E. Horvitz. The combination of text classifiers using reliability indicators. Information Retrieval, 8(1):67--100, 2005. Google ScholarDigital Library
- A. Bergholz and B. Chidlovskii. Crawling for Domain-Specific Hidden Web Resources. In Proceedings of WISE, pages 125--133, 2003. Google ScholarDigital Library
- Brightplanet's searchable databases directory. http://www.completeplanet.com.Google Scholar
- S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pages 148--159, 2002. Google ScholarDigital Library
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31(11-16):1623--1640, 1999. Google ScholarDigital Library
- K. C.-C. Chang, B. He, and Z. Zhang. Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web. In Proc. of CIDR, pages 44--55, 2005.Google Scholar
- J. Cope, N. Craswell, and D. Hawking. Automated Discovery of Search Interfaces on the Web. In Proceedings of ADC, pages 181--189, 2003. Google ScholarDigital Library
- M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of VLDB, pages 527--534, 2000. Google ScholarDigital Library
- Y. Even-Zohar and D. Roth. A sequential model for multi-class classification. In Empirical Methods in Natural Language Processing, 2001.Google Scholar
- M. Galperin. The molecular biology database collection: 2005 update. Nucleic Acids Res, 33, 2005.Google Scholar
- S. Gangaputra and D. Geman. A design principle for coarse-to-fine classification. In Proceedings of CVPR, pages 1877--1884, 2006. Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, and M. Sahami. Qprober: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1):1--41, 2003. Google ScholarDigital Library
- B. He and K. C.C. Chang. Statistical Schema Matching across Web Query Interfaces. In Proceedings of ACM SIGMOD, pages 217--228, 2003. Google ScholarDigital Library
- B. He, T. Tao, and K. C.C. Chang. Organizing structured web sources by query schemas: a clustering approach. In Proc. of CIKM, pages 22--31, 2004. Google ScholarDigital Library
- H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of VLDB, pages 357--368, 2003. Google ScholarDigital Library
- B. Heisele, T. Serreb, S. Prenticeb, and T. Poggiob. Hierarchical Classification and Feature Reduction for Fast face Detection with Support Vector Machines. Pattern Recognition, 36(9), 2003.Google Scholar
- A. Hess and N. Kushmerick. Automatically attaching semantic metadata to web services. In Proceedings of IIWeb, pages 111--116, 2003.Google Scholar
- W. Hsieh, J. Madhavan, and R. Pike. Data management projects at Google. In Proceedings of ACM SIGMOD, pages 725--726, 2006. Google ScholarDigital Library
- T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In Proceedings of VLDB, pages 129--138, 2001. Google ScholarDigital Library
- J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of ICML, pages 335--343, 1999. Google ScholarDigital Library
- Y. Ru and E. Horowitz. Indexing the invisible Web: a survey. Online Information Review, 29(3):249--265, 2005.Google ScholarCross Ref
- E. H. Simpson. Measurement of Diversity. Nature, 163:688, 1949.Google ScholarCross Ref
- S. Sizov, M. Biwer, J. Graupmann, S. Siersdorfer, M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proc. of CIDR, 2003.Google Scholar
- The UIUC Web integration repository. http://metaquerier.cs.uiuc.edu/repository.Google Scholar
- I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2nd edition, 2005. Google ScholarDigital Library
- P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources. In Proceedings of ICDE, page 47, 2006. Google ScholarDigital Library
- W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. In Proceedings of ACM SIGMOD, pages 95--106, 2004. Google ScholarDigital Library
Index Terms
- Combining classifiers to identify online databases
Recommendations
A Novel Architecture for Deep Web Crawler
A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the ...
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in TechnologyFor context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Finding pages on the unarchived web
JCDL '14: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital LibrariesWeb archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies---most of the Web is unarchived and therefore lost to posterity. In this paper, we ...
Comments