| Geographically focused collaborative crawling |
| Full text |
Pdf
(243 KB)
|
| Source
|
International World Wide Web Conference
archive
Proceedings of the 15th international conference on World Wide Web
table of contents
Edinburgh, Scotland
SESSION: Search engine engineering
table of contents
Pages: 287 - 296
Year of Publication: 2006
ISBN:1-59593-323-9
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 103, Citation Count: 3
|
|
|
ABSTRACT
A collaborative crawler is a group of crawling nodes, in which each crawling node is responsible for a specific portion of the web. We study the problem of collecting geographi-cally-aware pages using collaborative crawling strategies. We first propose several collaborative crawling strategies for the geographically focused crawling, whose goal is to collect web pages about specified geographic locations, by considering features like URL address of page, content of page, extended anchor text of link, and others. Later, we propose various evaluation criteria to qualify the performance of such crawling strategies. Finally, we experimentally study our crawling strategies by crawling the real web data showing that some of our crawling strategies greatly outperform the simple URL-hash based partition collaborative crawling, in which the crawling assignments are determined according to the hash-value computation over URLs. More precisely, features like URL address of page and extended anchor text of link are shown to yield the best overall performance for the geographically focused crawling.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
S. Borgatti. Centrality and network flow. Social Networks, 27(1):55--71, 2005.
|
| |
4
|
P. D. Bra, Y. K. Geert-Jan Houben, and R. Post. Information retrieval in distributed hypertexts. In RIAO, pages 481--491, 1994.
|
| |
5
|
O. Buyukkokten, J. Cho, H. Garcia-Molina, L. Gravano, and N. Shivakumar. Exploiting geographical location information of web pages. In WebDB (Informal Proceedings), pages 91--96, 1999.
|
| |
6
|
S. Chakrabarti. Mining the Web. Morgan Kaufmann Publishers, 2003.
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
 |
16
|
|
 |
17
|
José Exposto , Joaquim Macedo , António Pina , Albano Alves , José Rufino, Geographical partition for distributed web crawling, Proceedings of the 2005 workshop on Geographic information retrieval, November 04-04, 2005, Bremen, Germany
[doi> 10.1145/1096985.1096999]
|
 |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
H. C. Lee and R. Miller. Bringing geographical order to the web. private communication, 2005.
|
| |
22
|
A. Markowetz, Y.-Y. Chen, T. Suel, X. Long, and B. Seeger. Design and implementation of a geographic search engine. In WebDB, pages 19--24, 2005.
|
| |
23
|
|
 |
24
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
25
|
|
|