ABSTRACT
Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.
- The wayback machine. http://www.archive.org/web/web.php.Google Scholar
- CHO, J., AND GARCIA-MOLINA, H. Effective page refresh policies for web crawlers. ACM TODS 28, 4 (2003), 390--426. Google ScholarDigital Library
- CHO, J., AND NTOULAS, A. Effective change detection using sampling. In VLDB '02 (2002). Google ScholarDigital Library
- FETTERLY, D., MANASSE, M., NAJORK, M., AND WIENER, J. L. A large-scale study of the evolution of web pages. In Proceedings of WWW '04 (Budapest, Hungary, 2004). Google ScholarDigital Library
- GRIMMETT, G., AND STIRZAKER, D. Probability and Random Processes, 2nd ed. Oxford University Press, 1992.Google Scholar
- PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford Digital Library Technologies Project, 1998.Google Scholar
Index Terms
- Designing efficient sampling techniques to detect webpage updates
Recommendations
Clustering-based incremental web crawling
When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed ...
The Design and Implementation of a Deep Web Architecture
ICECC '12: Proceedings of the 2012 International Conference on Electronics, Communications and ControlAs the Internet continues develop, the amount of information in the network grows rapidly. Based on different ways of accessing information, internet can be divided into surface network and the deep web. Comparing to crawling surface network by means of ...
A Hybrid Approach for Web Change Detection
Search engines save copies of crawled web pages to provide instant search results, saved pages may become old and un-updated as original pages change providing new information and new links, and most of websites don't submit these new changes to search ...
Comments