skip to main content
10.1145/1390334.1390366acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

The opposite of smoothing: a language model approach to ranking query-specific document clusters

Published: 20 July 2008 Publication History

Abstract

Exploiting information induced from (query-specific) clustering of top-retrieved documents has long been proposed as means for improving precision at the very top ranks of the returned results. We present a novel language model approach to ranking query-specific clusters by the presumed percentage of relevant documents that they contain. While most previous cluster ranking approaches focus on the cluster as a whole, our model also exploits information induced from documents associated with the cluster. Our model substantially outperforms previous approaches for identifying clusters containing a high relevant-document percentage. Furthermore, using the model to produce document ranking yields precision-at-top-ranks performance that is consistently better than that of the initial ranking upon which clustering is performed; the performance also favorably compares with that of a state-of-the-art pseudo-feedback retrieval method.

References

[1]
N. Abdul-Jaleel, J. Allan, W. B. Croft, F. Diaz, L. Larkey, X. Li, M. D. Smucker, and C. Wade. UMASS at TREC 2004 - novelty and hard. In Proceedings of the Thirteenth Text Retrieval Conference (TREC-13), 2004.]]
[2]
L. Azzopardi, M. Girolami, and K. van Rijsbergen. Topic based language models for ad hoc information retrieval. In Proceedings of International Conference on Neural Networks and IEEE International Conference on Fuzzy Systems, pages 3281--3286, 2004.]]
[3]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, 1998.]]
[4]
C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using SMART: TREC3. In Proceedings of the Third Text Retrieval Conference (TREC-3), pages 69--80, 1994.]]
[5]
http://www.clusty.com.]]
[6]
M. Connell, A. Feng, G. Kumaran, H. Raghavan, C. Shah, and J. Allan. UMass at TDT 2004. TDT2004 System Description, 2004.]]
[7]
W. B. Croft. A model of cluster searching based on classification. Information Systems, 5:189--195, 1980.]]
[8]
W. B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Number 13 in Information Retrieval Book Series. Kluwer, 2003.]]
[9]
F. Diaz. Regularizing ad hoc retrieval scores. In Proceedings of the Fourteenth International Conference on Information and Knowledge Managment (CIKM), pages 672--679, 2005.]]
[10]
F. Diaz and D. Metzler. Improving the estimation of relevance models using large external corpora. In Proceedings of SIGIR, pages 154--161, 2006.]]
[11]
F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. In Proceedings of the 13th international conference on string processing and information retrieval (SPIRE), pages 25--37, 2006.]]
[12]
G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996.]]
[13]
A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1):3--11, 1986. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 365--373, 1997.]]
[14]
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, 1996.]]
[15]
N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217--240, 1971.]]
[16]
J. Kleinberg. Authoritative sources in a hyperlinked environment. Technical Report Research Report RJ 10076, IBM, May 1997.]]
[17]
O. Kurland. Inter-document similarities, language models, and ad hoc retrieval. PhD thesis, Cornell University, 2006.]]
[18]
O. Kurland and C. Domshlak. A rank-aggregation approach to searching for optimal query-specific clusters. In Proceedings of SIGIR, 2008.]]
[19]
O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR, pages 194--201, 2004.]]
[20]
O. Kurland and L. Lee. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR, pages 306--313, 2005.]]
[21]
O. Kurland and L. Lee. Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In Proceedings of SIGIR, pages 83--90, 2006.]]
[22]
J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.]]
[23]
J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.]]
[24]
V. Lavrenko, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT), pages 104--110, 2002.]]
[25]
V. Lavrenko and W. B. Croft. Relevance-based language models. In Proceedings of SIGIR, pages 120--127, 2001.]]
[26]
V. Lavrenko and W. B. Croft. Relevance models in information retrieval. In Croft and Lafferty {8}, pages 11--56.]]
[27]
A. Leuski. Evaluating document clustering for interactive information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Managment (CIKM), pages 33--40, 2001.]]
[28]
A. Leuski and J. Allan. Evaluating a visual navigation system for a digital library. In Proceedings of the Second European conference on research and advanced technology for digital libraries (ECDL), pages 535--554, 1998.]]
[29]
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of SIGIR, pages 186--193, 2004.]]
[30]
X. Liu and W. B. Croft. Experiments on retrieval of optimal clusters. Technical Report IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts, 2006.]]
[31]
X. Liu and W. B. Croft. Representing clusters for retrieval. In Proceedings of SIGIR, pages 671--672, 2006. Poster.]]
[32]
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference, pages 490--499, 2007.]]
[33]
C. R. Palmer, J. Pesenty, R. Veldes-Perez, M. Christel, A. G. Hauptmann, D. Ng, and H. D. Wactlar. Demonstration of hierarchical document clustering of digital library retrieval results. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries, page 451, 2001.]]
[34]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998.]]
[35]
S. E. Preece. Clustering as an output option. In Proceedings of the American Society for Information Science, pages 189--190, 1973.]]
[36]
J. G. Shanahan, J. Bennett, D. A. Evans, D. A. Hull, and J. Montgomery. Clairvoyance Corporation experiments in the TREC 2003. High accuracy retrieval from documents (HARD) track. In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), pages 152--160, 2003.]]
[37]
L. Si, R. Jin, J. Callan, and P. Ogilvie. A language modeling framework for resource selection and results merging. In Proceedings of the 11th International Conference on Information and Knowledge Managment (CIKM), pages 391--397, 2002.]]
[38]
A. Tombros, R. Villa, and C. van Rijsbergen. The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38(4):559--582, 2002.]]
[39]
P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In Proceedings of the sixth national conference on digital government research, pages 167--176, 2006.]]
[40]
C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979.]]
[41]
E. M. Voorhees. The cluster hypothesis revisited. In Proceedings of SIGIR, pages 188--196, 1985.]]
[42]
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of SIGIR, 2006.]]
[43]
P. Willett. Query specific automatic document classification. International Forum on Information and Documentation, 10(2):28--32, 1985.]]
[44]
J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of SIGIR, pages 4--11, 1996.]]
[45]
O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of SIGIR, pages 46--54, 1998.]]
[46]
C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pages 334--342, 2001.]]

Cited By

View all
  • (2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
  • (2018)Selective Cluster Presentation on the Search Results PageACM Transactions on Information Systems10.1145/315867236:3(1-42)Online publication date: 28-Feb-2018
  • (2016)Selective Cluster-Based Document RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983737(1473-1482)Online publication date: 24-Oct-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ad hoc retrieval
  2. cluster-based language models
  3. cluster-ranking
  4. language models
  5. query-specific clusters

Qualifiers

  • Research-article

Conference

SIGIR '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
  • (2018)Selective Cluster Presentation on the Search Results PageACM Transactions on Information Systems10.1145/315867236:3(1-42)Online publication date: 28-Feb-2018
  • (2016)Selective Cluster-Based Document RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983737(1473-1482)Online publication date: 24-Oct-2016
  • (2016)Improving Difficult Queries by Leveraging Clusters in Term GraphInformation Retrieval Technology10.1007/978-3-319-28940-3_37(426-432)Online publication date: 22-Jan-2016
  • (2015)Learning Asymmetric Co-RelevanceProceedings of the 2015 International Conference on The Theory of Information Retrieval10.1145/2808194.2809454(281-290)Online publication date: 27-Sep-2015
  • (2015)Mining document, concept, and term associations for effective biomedical retrieval: introducing MeSH-enhanced retrieval modelsInformation Retrieval Journal10.1007/s10791-015-9264-018:5(413-444)Online publication date: 4-Sep-2015
  • (2014)The Cluster Hypothesis in Information RetrievalProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.1007/978-3-319-06028-6_105(823-826)Online publication date: 13-Apr-2014
  • (2013)Clustering and Diversifying Web Search Results with Graph-Based Word Sense InductionComputational Linguistics10.1162/COLI_a_0014839:3(709-754)Online publication date: Sep-2013
  • (2012)Exploring the cluster hypothesis, and cluster-based retrieval, over the webProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398678(2507-2510)Online publication date: 29-Oct-2012
  • (2012)Query-performance prediction and cluster rankingProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398666(2459-2462)Online publication date: 29-Oct-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media