skip to main content
10.1145/1935826.1935849acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Quality-biased ranking of web documents

Published: 09 February 2011 Publication History

Abstract

Many existing retrieval approaches do not take into account the content quality of the retrieved documents, although link-based measures such as PageRank are commonly used as a form of document prior. In this paper, we present the quality-biased ranking method that promotes documents containing high-quality content, and penalizes low-quality documents. The quality of the document content can be determined by its readability, layout and ease-of-navigation, among other factors. Accordingly, instead of using a single estimate for document quality, we consider multiple content-based features that are directly integrated into a state-of- the-art retrieval method. These content-based features are easy to compute, store and retrieve, even for large web collections. We use several query sets and web collections to empirically evaluate the performance of our quality-biased retrieval method. In each case, our method consistently improves by a large margin the retrieval performance of text-based and link-based retrieval methods that do not take into account the quality of the document content.

Supplementary Material

JPG File (wsdm2011_bendersky_qbr_01.jpg)
MP4 File (wsdm2011_bendersky_qbr_01.mp4)

References

[1]
J. Allan, J. Aslam, B. Carterette, V. Pavlu, and E. Kanoulas. Million Query Track 2008 overview. In Proc. of TREC, 2008.
[2]
M. Bendersky and O. Kurland. Utilizing passage-based language models for document retrieval. In Proc. of ECIR, pages 162--174, 2008.
[3]
M. Bendersky, D. Metzler, and W. B. Croft. Learning concept importance using a weighted dependence model. In Proc. of WSDM, pages 31--40, 2010.
[4]
R. Blanco and A. Barreiro. Probabilistic document length priors for language models. In Proc. of ECIR, pages 394--405, 2008.
[5]
T. Brants and A. Franz. Web 1T 5-gram Version 1, 2006.
[6]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, April 1998.
[7]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proc. of ICML, pages 89--96, 2005.
[8]
C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 Terabyte Track. In Proc. of TREC, 2004.
[9]
C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2009 Web Track. In Proc. of TREC, 2009.
[10]
G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Arxiv. org, Apr 2010.
[11]
N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In Proc. of SIGIR, pages 416--423, 2005.
[12]
M. Ivory and M. Hearst. Improving web site design. Internet Computing, IEEE, 6(2):56--63, 2002.
[13]
T. Joachims. Optimizing search engines using clickthrough data. In Proc. of KDD, pages 133--142, 2002.
[14]
T. Kanungo and D. Orr. Predicting the readability of short web summaries. In Proc. of WSDM, pages 202--211, 2009.
[15]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, September 1999.
[16]
W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proc. of SIGIR, pages 27--34, 2002.
[17]
O. Kurland and L. Lee. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In Proc. of SIGIR, pages 306--313, 2005.
[18]
J. Lin, D. Metzler, T. Elsayed, and L. Wang. Of Ivory and Smurfs: Loxodontan MapReduce experiments for web search. In Proc. of TREC, 2009.
[19]
J. Liu, P. Dolan, and E. R. Pedersen. Personalized news recommendation based on click behavior. In Proc. of IUI, pages 31--40, 2010.
[20]
T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 2009.
[21]
Y. Liu, B. Gao, T. Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. BrowseRank: letting web users vote for page importance. In Proc. of SIGIR, pages 451--458, 2008.
[22]
D. Metzler and W. Bruce Croft. Linear feature-based models for information retrieval. Information Retrieval, 10(3):257--274, 2007.
[23]
D. Metzler and W. B. Croft. A Markov random field model for term dependencies. In Proc. of SIGIR, pages 472--479, 2005.
[24]
D. Metzler, T. Strohman, and W. B. Croft. Indri at TREC 2005: Terabyte track. In Proc. of TREC, 2005.
[25]
D. Metzler, T. Strohman, and W. B. Croft. Indri at TREC 2006: Lessons learned from three Terabyte tracks. In Proc. of TREC, 2006.
[26]
M. A. Najork. Comparing the effectiveness of HITS and SALSA. In Proc. of CIKM, pages 157--164, 2007.
[27]
A. Ntoulas and M. Manasse. Detecting spam web pages through content analysis. In Proc. of WWW, pages 83--92, 2006.
[28]
J. Peng and I. Ounis. Combination of document priors in web information retrieval. In Proc. of ECIR, pages 732--736, 2007.
[29]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR, pages 275--281, 1998.
[30]
M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: machine learning for static ranking. In Proc. of WWW, pages 707--715, 2006.
[31]
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. of SIGIR, pages 21--29, 1996.
[32]
Y. Zhou and W. B. Croft. Document quality models for web ad hoc retrieval. In Proc. of CIKM, pages 331--332, 2005.
[33]
X. Zhu and S. Gauch. Incorporating quality metrics in centralized/distributed information retrieval on the world wide web. In Proc. of SIGIR, pages 288--295, 2000.

Cited By

View all
  • (2024)Ranking-Incentivized Document Manipulations for Multiple QueriesProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672516(61-70)Online publication date: 2-Aug-2024
  • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
  • (2024)A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrievalExpert Systems with Applications10.1016/j.eswa.2024.123177246(123177)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. Quality-biased ranking of web documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
    February 2011
    870 pages
    ISBN:9781450304931
    DOI:10.1145/1935826
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 February 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document quality
    2. quality-biased ranking

    Qualifiers

    • Research-article

    Conference

    Acceptance Rates

    WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Ranking-Incentivized Document Manipulations for Multiple QueriesProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672516(61-70)Online publication date: 2-Aug-2024
    • (2024)Neural Passage Quality Estimation for Static PruningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657765(174-185)Online publication date: 10-Jul-2024
    • (2024)A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrievalExpert Systems with Applications10.1016/j.eswa.2024.123177246(123177)Online publication date: Jul-2024
    • (2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024
    • (2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
    • (2022)Towards Automated Safety Vetting of Smart Contracts in Decentralized ApplicationsProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559384(921-935)Online publication date: 7-Nov-2022
    • (2022)Competitive SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532771(2838-2849)Online publication date: 6-Jul-2022
    • (2021)Extraction of Effective Textual and Semantic Features in Learning to Rank for Web Document RetrievalIranian Journal of Information Processing and Management10.52547/jipm.36.4.108136:4(1081-1112)Online publication date: 1-Jul-2021
    • (2021)Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale StudyProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441809(301-309)Online publication date: 8-Mar-2021
    • (2021)A Multi-Task Learning Model for Multidimensional Relevance AssessmentExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-030-85251-1_9(103-115)Online publication date: 14-Sep-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media