skip to main content
10.1145/1835449.1835521acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

A content based approach for discovering missing anchor text for web search

Published: 19 July 2010 Publication History

Abstract

Although anchor text provides very useful information for web search, a large portion of web pages have few or no incoming hyperlinks (anchors), which is known as the anchor text sparsity problem. In this paper, we propose a language modeling based technique for overcoming anchor text sparsity by discovering a web page's plausible missing anchor text from its similar web pages' in-link anchor text. We design experiments with two publicly available TREC web corpora (GOV2 and ClueWeb09) to evaluate different approaches for discovering missing anchor text. Experimental results show that our approach can effectively discover plausible missing anchor terms. We then use the web named page finding task in the TREC Terabyte track to explore the utility of missing anchor text information discovered by our approach for helping retrieval. Experimental results show that our approach can statistically significantly improve retrieval performance, compared with several approaches that only use anchor text aggregated over the web graph.

References

[1]
A. Broder et al. Graph structure in the web. Comput. Netw., 33(1-6):309--320, 2000.
[2]
J. Broglio, J. P. Callan, and W. B. Croft. An overview of the INQUERY system as used for the TIPSTER project. Technical report, Amherst, MA, USA, 1993.
[3]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proc. of ICML, pp. 89--96, 2005.
[4]
S. Büttcher, C. L. A. Clarke, and I. Soboroff. The TREC 2006 Terabyte Track. In TREC, 2006.
[5]
C. L. A. Clarke, F. Scholer, and I. Soboroff. The TREC 2005 Terabyte Track. In TREC, 2005.
[6]
A. Fujii. Modeling anchor text and classifying queries to enhance web document retrieval. In Proc. of WWW, pp. 337--346, 2008.
[7]
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002.
[8]
O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In SIGIR, pp. 194--201, 2004.
[9]
O. Kurland and L. Lee. Respect my authority!: Hits without hyperlinks, utilizing cluster-based language models. In SIGIR, pp. 83--90, 2006.
[10]
V. Lavrenko and W. B. Croft. Relevance based language models. In SIGIR, pp. 120--127, 2001.
[11]
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In SIGIR, pp. 186--193, 2004.
[12]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge Univ. Press. 2008.
[13]
Q. Mei, D. Zhang, and C. Zhai. A general optimization framework for smoothing language models on graph structures. In SIGIR, pp. 611--618, 2008.
[14]
D. Metzler, J. Novak, H. Cui, and S. Reddy. Building enriched document representations using aggregated anchor text. In SIGIR, pp. 219--226, 2009.
[15]
R. Nallapati, B. Croft, and J. Allan. Relevant query feedback in statistical language modeling. In Proc. of CIKM, pp. 560--563, 2003.
[16]
P. Ogilvie and J. Callan. Combining document representations for known-item search. In SIGIR, pp. 143--150, 2003.
[17]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pp. 275--281, 1998.
[18]
T. Tao, X. Wang, Q. Mei, and C. Zhai. Language model information retrieval with document expansion. In Proc. of NAACL-HLT, pp. 407--414, 2006.
[19]
X. Wang and C. Zhai. Mining term association patterns from search logs for effective query reformulation. In Proc. of CIKM, pp. 479--488, 2008.
[20]
T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors. In Proc. of TREC, pp. 663--672, 2001.

Cited By

View all
  • (2023)Unsupervised Dense Retrieval Training with Web AnchorsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592080(2476-2480)Online publication date: 19-Jul-2023
  • (2021)Pre-training for Ad-hoc RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482286(1212-1221)Online publication date: 26-Oct-2021
  • (2013)Incorporating social anchors for ad hoc retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491786(181-188)Online publication date: 15-May-2013

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
July 2010
944 pages
ISBN:9781450301534
DOI:10.1145/1835449
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anchor text
  2. anchor text sparsity
  3. content similarity
  4. language models
  5. relevance models
  6. web search

Qualifiers

  • Research-article

Conference

SIGIR '10
Sponsor:

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Unsupervised Dense Retrieval Training with Web AnchorsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592080(2476-2480)Online publication date: 19-Jul-2023
  • (2021)Pre-training for Ad-hoc RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482286(1212-1221)Online publication date: 26-Oct-2021
  • (2013)Incorporating social anchors for ad hoc retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491786(181-188)Online publication date: 15-May-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media