research-article

Quality-biased ranking of web documents

Authors:

Michael Bendersky,

W. Bruce Croft,

Yanlei DiaoAuthors Info & Claims

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 95 - 104

https://doi.org/10.1145/1935826.1935849

Published: 09 February 2011 Publication History

Abstract

Many existing retrieval approaches do not take into account the content quality of the retrieved documents, although link-based measures such as PageRank are commonly used as a form of document prior. In this paper, we present the quality-biased ranking method that promotes documents containing high-quality content, and penalizes low-quality documents. The quality of the document content can be determined by its readability, layout and ease-of-navigation, among other factors. Accordingly, instead of using a single estimate for document quality, we consider multiple content-based features that are directly integrated into a state-of- the-art retrieval method. These content-based features are easy to compute, store and retrieve, even for large web collections. We use several query sets and web collections to empirically evaluate the performance of our quality-biased retrieval method. In each case, our method consistently improves by a large margin the retrieval performance of text-based and link-based retrieval methods that do not take into account the quality of the document content.

Supplementary Material

JPG File (wsdm2011_bendersky_qbr_01.jpg)

Download
11.68 KB

MP4 File (wsdm2011_bendersky_qbr_01.mp4)

Download
166.60 MB

References

[1]

J. Allan, J. Aslam, B. Carterette, V. Pavlu, and E. Kanoulas. Million Query Track 2008 overview. In Proc. of TREC, 2008.

[2]

M. Bendersky and O. Kurland. Utilizing passage-based language models for document retrieval. In Proc. of ECIR, pages 162--174, 2008.

Digital Library

[3]

M. Bendersky, D. Metzler, and W. B. Croft. Learning concept importance using a weighted dependence model. In Proc. of WSDM, pages 31--40, 2010.

Digital Library

[4]

R. Blanco and A. Barreiro. Probabilistic document length priors for language models. In Proc. of ECIR, pages 394--405, 2008.

Digital Library

[5]

T. Brants and A. Franz. Web 1T 5-gram Version 1, 2006.

[6]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, April 1998.

Digital Library

[7]

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proc. of ICML, pages 89--96, 2005.

Digital Library

[8]

C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 Terabyte Track. In Proc. of TREC, 2004.

[9]

C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2009 Web Track. In Proc. of TREC, 2009.

[10]

G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Arxiv. org, Apr 2010.

[11]

N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In Proc. of SIGIR, pages 416--423, 2005.

Digital Library

[12]

M. Ivory and M. Hearst. Improving web site design. Internet Computing, IEEE, 6(2):56--63, 2002.

Digital Library

[13]

T. Joachims. Optimizing search engines using clickthrough data. In Proc. of KDD, pages 133--142, 2002.

Digital Library

[14]

T. Kanungo and D. Orr. Predicting the readability of short web summaries. In Proc. of WSDM, pages 202--211, 2009.

Digital Library

[15]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, September 1999.

Digital Library

[16]

W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proc. of SIGIR, pages 27--34, 2002.

Digital Library

[17]

O. Kurland and L. Lee. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In Proc. of SIGIR, pages 306--313, 2005.

Digital Library

[18]

J. Lin, D. Metzler, T. Elsayed, and L. Wang. Of Ivory and Smurfs: Loxodontan MapReduce experiments for web search. In Proc. of TREC, 2009.

[19]

J. Liu, P. Dolan, and E. R. Pedersen. Personalized news recommendation based on click behavior. In Proc. of IUI, pages 31--40, 2010.

Digital Library

[20]

T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 2009.

[21]

Y. Liu, B. Gao, T. Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. BrowseRank: letting web users vote for page importance. In Proc. of SIGIR, pages 451--458, 2008.

Digital Library

[22]

D. Metzler and W. Bruce Croft. Linear feature-based models for information retrieval. Information Retrieval, 10(3):257--274, 2007.

Digital Library

[23]

D. Metzler and W. B. Croft. A Markov random field model for term dependencies. In Proc. of SIGIR, pages 472--479, 2005.

Digital Library

[24]

D. Metzler, T. Strohman, and W. B. Croft. Indri at TREC 2005: Terabyte track. In Proc. of TREC, 2005.

[25]

D. Metzler, T. Strohman, and W. B. Croft. Indri at TREC 2006: Lessons learned from three Terabyte tracks. In Proc. of TREC, 2006.

[26]

M. A. Najork. Comparing the effectiveness of HITS and SALSA. In Proc. of CIKM, pages 157--164, 2007.

Digital Library

[27]

A. Ntoulas and M. Manasse. Detecting spam web pages through content analysis. In Proc. of WWW, pages 83--92, 2006.

Digital Library

[28]

J. Peng and I. Ounis. Combination of document priors in web information retrieval. In Proc. of ECIR, pages 732--736, 2007.

Digital Library

[29]

J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR, pages 275--281, 1998.

Digital Library

[30]

M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: machine learning for static ranking. In Proc. of WWW, pages 707--715, 2006.

Digital Library

[31]

A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. of SIGIR, pages 21--29, 1996.

Digital Library

[32]

Y. Zhou and W. B. Croft. Document quality models for web ad hoc retrieval. In Proc. of CIKM, pages 331--332, 2005.

Digital Library

[33]

X. Zhu and S. Gauch. Incorporating quality metrics in centralized/distributed information retrieval on the world wide web. In Proc. of SIGIR, pages 288--295, 2000.

Digital Library

Cited By

Nachimovsky HTennenholtz MRaiber FKurland OOosterhuis HBast HXiong C(2024)Ranking-Incentivized Document Manipulations for Multiple QueriesProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672516(61-70)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672516
Chang XMishra DMacdonald CMacAvaney SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657765
Aydın AArslan ADinçer B(2024)A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrievalExpert Systems with Applications10.1016/j.eswa.2024.123177246(123177)Online publication date: Jul-2024
https://doi.org/10.1016/j.eswa.2024.123177
Show More Cited By

Index Terms

Quality-biased ranking of web documents
1. Information systems
  1. Information retrieval

Recommendations

Ranking Documents by Answer-Passage Quality
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Evidence derived from passages that closely represent likely answers to a posed query can be useful input to the ranking process. Based on a novel use of Community Question Answering data, we present an approach for the creation of such passages. A ...
Document quality models for web ad hoc retrieval
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

The quality of document content, which is an issue that is usually ignored for the traditional ad hoc retrieval task, is a critical issue for Web search. Web pages have a huge variation in quality relative to, for example, newswire articles. To address ...
Quality models for microblog retrieval
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Microblog services typically contain very short documents (e.g., tweets) containing comments about the latest news and events. Many of these documents are not informative or have very little content due to their personal and ephemeral nature. Providing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

February 2011

870 pages

ISBN:9781450304931

DOI:10.1145/1935826

General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM'11

Sponsor:

WSDM'11: Fourth ACM International Conference on Web Search and Data Mining

February 9 - 12, 2011

Hong Kong, China

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
1,069
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nachimovsky HTennenholtz MRaiber FKurland OOosterhuis HBast HXiong C(2024)Ranking-Incentivized Document Manipulations for Multiple QueriesProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672516(61-70)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672516
Chang XMishra DMacdonald CMacAvaney SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657765
Aydın AArslan ADinçer B(2024)A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrievalExpert Systems with Applications10.1016/j.eswa.2024.123177246(123177)Online publication date: Jul-2024
https://doi.org/10.1016/j.eswa.2024.123177
Peikos GPasi G(2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024
https://doi.org/10.1002/widm.1541
Vasilisky ZKurland OTennenholtz MRaiber FYoshioka MKiseleva JAliannejadi M(2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605124
Duan YZhao XPan YLi SLi MXu FZhang MYin HStavrou ACremers CShi E(2022)Towards Automated Safety Vetting of Smart Contracts in Decentralized ApplicationsProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559384(921-935)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3548606.3559384
Kurland OTennenholtz MAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Competitive SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532771(2838-2849)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3532771
Mahjoob MEnsan FKeshvari SJafarzadeh Pkeyvanzad M(2021)Extraction of Effective Textual and Semantic Features in Learning to Rank for Web Document RetrievalIranian Journal of Information Processing and Management10.52547/jipm.36.4.108136:4(1081-1112)Online publication date: 1-Jul-2021
https://doi.org/10.52547/jipm.36.4.1081
Bahri DTay YZheng CBrunk CMetzler DTomkins ALewin-Eytan LCarmel DYom-Tov EAgichtein EGabrilovich E(2021)Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale StudyProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441809(301-309)Online publication date: 8-Mar-2021
https://dl.acm.org/doi/10.1145/3437963.3441809
Putri DViviani MPasi G(2021)A Multi-Task Learning Model for Multidimensional Relevance AssessmentExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_9(103-115)Online publication date: 14-Sep-2021
https://doi.org/10.1007/978-3-030-85251-1_9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents