skip to main content
10.1145/1031453.1031458acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Probabilistic models for focused web crawling

Published: 12 November 2004 Publication History

Abstract

A Focused crawler must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modelling of context as well as the current observations. Probabilistic models, such as Hidden Markov Models(HMMs) and Conditional Random Fields(CRFs), can potentially capture both formatting and context. In this paper, we present the use of HMM for focused web crawling, and compare it with Best-First strategy. Furthermore, we discuss the concept of using CRFs to overcome the difficulties with HMMs and support the use of many, arbitrary and overlapping features. Finally, we describe a design of a system applying CRFs for focused web crawling, that is currently being implemented.

References

[1]
C. Aggarwal, F. Al-Garawi, and P. Yu. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In Proceedings of the 10th International WWW Conference, Hong Kong, May 2001.
[2]
M. W. Berry. LSI: Latent Semantic Indexing Web Site. http://www.cs.utk.edu/lsi/, Accessed May 15, 2004.
[3]
S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International WWW Conference, Hawaii, 2002, USA.
[4]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: a new approach to topic-specific web resource discovery. In Proceedings of the Eighth International WWW Conference, Toronto, Canada,1999.
[5]
J. Dean and M. Henzinger. Finding Related Pages in the World Wide Web. In Proceedings of the 8th International WWW Conference, pages 389--401, 1999.
[6]
M. Diligenti, F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000), Cairo, Egypt, September 2000.
[7]
F.Sha and F. Pereira. Shallow Parsing with Conditional Random Fields. In Proceedings of Human Language Technology, NAACL, 2003.
[8]
M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The Shark-Search Algorithm - An Application: Tailored Web Site Mapping. In Proceedings of the Seventh International WWW Conference, Brisbane, Australia, April 1998.
[9]
T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In Proceedings of IJCAI97, August 1997. Accessed May 2004. http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-6/web-agent/www/project-%home.html.
[10]
J. Johnson, K. Tsioutsiouliklis, and C. L. Giles. Evolving Strategies for Focused Web Crawling. In The Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003 Washington, DC USA.
[11]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting an labeling sequence data. In International Conference on Macvhine Learning(ICML-2001), 2001.
[12]
H. Liu, E. Milios, and J. Janssen. Focused Crawling by Learning HMM from User's Topic-specific Browsing. In 2004 IEEE/WIC International Conference on Web Intelligence, Sep.20-24, 2004.
[13]
D. MacKay. Information Theory, Inference, and Learning Algorithms. Book ISBN:0521642981, Copyright Cambridge University Press, 2003.
[14]
F. Menczer and R. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the web. Machine Learning, 39(2/3):203--242, 2000.
[15]
F. Menczer, G. Pant, and P. Srinivasan. Myspiders: Evolve your own intelligent web crawlers. In Autonomous Agents and Multi-Agent Systems, pages 5, 241--249, 2003.
[16]
F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. In To appear in ACM TOIT, Accessed May 2004. http://www.informatics.indiana.edu/fil/Papers/TOIT.pdf.
[17]
F. Menczer, G. Pant, P. Srinivasan, and M. Ruiz. Evaluating Topic-Driven Web Crawlers. In Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, USA, 2001.
[18]
M. Najork and I. N. Wiener. Breadth-first Search crawling yields high-quality pages. In Proceedings of the 10th International WWW Conference, Hong Kong, May 2001.
[19]
D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table Extraction Using Conditional Random Fields. In 26th Annual International ACM SIGIR conference, pages Jul 28 -- Aug. 1, 2003.
[20]
L. R. Rabiner. A Tutorial on Hidden Markov Model and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257--285, 1989.
[21]
J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of the Sixteenth International Conference on Machine Learning(ICML-99), pages 335--343, 1999.
[22]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
[23]
P. Srinivasan, F. Menczer, and G. Pant. A general evaluation framework for topical web crawlers. In Information Retrieval, Submitted, Accessed May 2004. http://www.informatics.indiana.edu/fil/Papers/crawl-framework.pdf.

Cited By

View all
  • (2023)Prediction of Web Page Relevance in Focused Crawling Using Artificial Neural Networks2023 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS)10.1109/ICTACS59847.2023.10389871(857-862)Online publication date: 1-Nov-2023
  • (2017)Predictive and evolutive cross-referencing for web textual sources2017 Computing Conference10.1109/SAI.2017.8252230(1114-1122)Online publication date: Jul-2017
  • (2015)Lexicon-based context-sensitive reference comments crawlerJournal of Information Science10.1177/016555151557592141:3(342-353)Online publication date: 10-Mar-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WIDM '04: Proceedings of the 6th annual ACM international workshop on Web information and data management
November 2004
168 pages
ISBN:1581139780
DOI:10.1145/1031453
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. conditional random fields
  2. focused crawling
  3. hidden Markov models
  4. web graph
  5. world wide web

Qualifiers

  • Article

Conference

CIKM04
Sponsor:
CIKM04: Conference on Information and Knowledge Management
November 12 - 13, 2004
Washington DC, USA

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Prediction of Web Page Relevance in Focused Crawling Using Artificial Neural Networks2023 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS)10.1109/ICTACS59847.2023.10389871(857-862)Online publication date: 1-Nov-2023
  • (2017)Predictive and evolutive cross-referencing for web textual sources2017 Computing Conference10.1109/SAI.2017.8252230(1114-1122)Online publication date: Jul-2017
  • (2015)Lexicon-based context-sensitive reference comments crawlerJournal of Information Science10.1177/016555151557592141:3(342-353)Online publication date: 10-Mar-2015
  • (2014)Sentiment-Focused Web CrawlingACM Transactions on the Web10.1145/26448218:4(1-21)Online publication date: 6-Nov-2014
  • (2013)Topical crawling on the web through local site-searchesJournal of Web Engineering10.5555/2535629.253563112:3-4(203-214)Online publication date: 1-Jul-2013
  • (2012)Sentiment-focused web crawlingProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398564(2020-2024)Online publication date: 29-Oct-2012
  • (2012)A novel focused crawler based on breadcrumb navigationProceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II10.1007/978-3-642-31020-1_31(264-271)Online publication date: 17-Jun-2012
  • (2012)SOF: a semi‐supervised ontology‐learning‐based focused crawlerConcurrency and Computation: Practice and Experience10.1002/cpe.298025:12(1755-1770)Online publication date: 20-Dec-2012
  • (2011)A solution to the exact match on rare item searchesProceedings of the International Conference on Web Intelligence, Mining and Semantics10.1145/1988688.1988734(1-12)Online publication date: 25-May-2011
  • (2009)Improving the performance of focused web crawlersData & Knowledge Engineering10.1016/j.datak.2009.04.00268:10(1001-1013)Online publication date: 1-Oct-2009
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media