skip to main content
10.1145/1835449.1835528acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Multi-style language model for web scale information retrieval

Published: 19 July 2010 Publication History

Abstract

Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an 'open-vocabulary' smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the cross-entropy analysis, and the combined mixture model outperforms the state-of-the-art BM25F based system.

References

[1]
Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proc. SIGIR-99, 222--229.
[2]
Brown, P., della Pietra, S. A., della Pietra, V. J., Lai, J., Mercer, R. L. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1), 31--40.
[3]
Bulyko, I., Ostendorff, M., Siu, M., Ng, T., Stolcke, A., and Cetin, O. 2007. Web resources for language modeling in conversational speech recognition. ACM Trans. on Speech and Language Processing, 5(1), December, 2005, 1--25.
[4]
Clark, C. L. A., and Craswell, N. 2009. Report on the TREC 2009 Web Track. In Proc. TREC 2009.
[5]
Collins-Thompson, K. and Callan, J. 2005. Query expansion using random walk models. In Proc. CIKM' 05, Bremen, Germany, 704--711.
[6]
Croft, W. B., Metzler, D., and Strohman, T. 2009. Search Engines: information retrieval in practice, Addison Wesley.
[7]
Duda, R. O., Hart, P. E. 1973. Pattern Classification and Scene Analysis, Wiley, New York.
[8]
Fang, H., Tao, T., Zhai, C. 2004. A formal study of information retrieval heuristics. In Proc. SIGIR-04, 49--56.
[9]
Gao, J., Nie, J., Wu, G., Cao, G. 2004. Dependence language model for information retrieval. In Proc. SIGIR-04, 170--177.
[10]
Hiemstra, D. and Kraaij, W. 2005. 21 language models at TREC: A language modeling approach to the text retrieval conference. In TREC: Experimental and Evaluation in In-formation Retrieval, MIT Press, E. M. Voorhees and D. Harman (eds).
[11]
Huang, X. D., Acero, A., and Hon, H.-W. 2001. Spoken Language Processing, Prentice Hall PTR, New Jersey.
[12]
Huang, J., Gao, J., Miao, J., Li, X., Wang, K., and Behr, F. 2010. Exploring web scale language models for search query processing. In Proc. WWW 2010.
[13]
Jaynes, E. T. 1957. Information theory and statistical mechanics. In Physical Review Series II, American Physical Society, 106(4), 620--630.
[14]
Jin, R., Hauptmann, and Zhai, C. 2002. Title language model for information retrieval. In Proc. SIGIR-02, 42--48.
[15]
Kraaij, W., Westerveld, T., and Hiemstra, D., 2002. The importance of prior probabilities for entry page search. In Proc. SIGIR'02, Tampere, Finland, 27--32.
[16]
Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proc. SIGIR'01, New Orleans, LA, 111--119.
[17]
Lavrenko, V., and Croft, W. B. 2001. Relevance-based language models. In Proc. SIGIR'01, New Orleans, LA, 120--127.
[18]
Manning, C., Raghavan, P., and Schutze, H. 2008. Introduction to information retrieval, Cambridge University Press.
[19]
Maron, M, and Kuhns, J. 1960. On relevance, probabilistic indexing and information retrieval. Journal of ACM, 7, 216--244.
[20]
Microsoft web n-gram services. http://research.microsoft.com/web-ngram
[21]
Miller, D., Leek, T., Schwartz, R. M. 1999. A hidden Markov model information retrieval system. In Proc. SIGIR-99, 214--222.
[22]
Ogilvie, P. and Callan, J. 2003. Combining document representations for known item search. In Proc. SIGIR-03, 143--151.
[23]
Orlitsky, A., Santhanam, N. P., and Zhang, J. 2003. Always Good Turing: asymptotically optimal probability estimation. Science, 302(5644), 427--431.
[24]
Ponte, J., and W. B. Croft. 1998. A language model approach to information retrieval. In Proc. SIGIR-98, 275--281.
[25]
Robertson, S. E., Walker, S., Sparck-Jones, K. S., Hancock-Beaulieu, M. M., and Gatford, M. 1994. Okapi at TREC-3. In Proc. the third text retrieval conference (TREC-3), D. K. Harman (eds.), NIST special publication 500-225, Gaithersburg, MD, 109--126.
[26]
Robertson, S. E., Zaragoza, H., and Taylor, M. 2004. Simple BM25 extension to multiple weighted fields. In Proc. CIKM-2004, 42--49.
[27]
Stolcke, A. 1998. Entropy-based pruning of backoff language models. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, 270--274.
[28]
Svore, K. M. and Burges, C. J. C. 2009. A machine learning approach for improved BM25 retrieval. In Proc. CIKM'09, Hong Kong, China.
[29]
Wang, K. and Li, X. 2009. Efficacy of a constantly adaptive language model technique for web-scale applications. In Proc. ICASSP-2009, Taipei, Taiwan, 4733--4736.
[30]
Wang, Y.-M., Ma, M., Niu, Y., and Chen, H. 2007. Spam double-funnel: connecting web spammers with advertisers. In Proc. WWW-2007, 291--300.
[31]
Zhai, C. 2008. Statistical language models for information retrieval: a critical review. Foundations and Trends in Information Retrieval, Vol. 2(3), 137--215.
[32]
Zhai, C. and Lafferty, J. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. SIGIR'01, New Orleans, LA, 334--342.
[33]
Zhai, C., and Lafferty, J. 2002. Two-stage language models for information retrieval. In Proc. SIGIR'02, Tampere, Finland, 49--56.

Cited By

View all
  • (2014)A Latent Semantic Model with Convolutional-Pooling Structure for Information RetrievalProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661935(101-110)Online publication date: 3-Nov-2014
  • (2013)Modeling click-through based word-pairs for web searchProceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval10.1145/2484028.2484082(483-492)Online publication date: 28-Jul-2013
  • (2013)Representations for multi-document event clusteringData Mining and Knowledge Discovery10.1007/s10618-012-0270-126:3(533-558)Online publication date: 1-May-2013
  • Show More Cited By

Index Terms

  1. Multi-style language model for web scale information retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
    July 2010
    944 pages
    ISBN:9781450301534
    DOI:10.1145/1835449
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. information retrieval
    2. mixture language models
    3. parameter estimation
    4. probabilistic relevance model
    5. smoothing

    Qualifiers

    • Research-article

    Conference

    SIGIR '10
    Sponsor:

    Acceptance Rates

    SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2014)A Latent Semantic Model with Convolutional-Pooling Structure for Information RetrievalProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661935(101-110)Online publication date: 3-Nov-2014
    • (2013)Modeling click-through based word-pairs for web searchProceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval10.1145/2484028.2484082(483-492)Online publication date: 28-Jul-2013
    • (2013)Representations for multi-document event clusteringData Mining and Knowledge Discovery10.1007/s10618-012-0270-126:3(533-558)Online publication date: 1-May-2013
    • (2012)Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling TechniquesIEICE Transactions on Information and Systems10.1587/transinf.E95.D.1195E95.D:5(1195-1205)Online publication date: 2012
    • (2012)Salton award lecture information retrieval as engineering scienceACM SIGIR Forum10.1145/2422256.242225946:2(19-28)Online publication date: 21-Dec-2012
    • (2012)The downside of markupProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398558(1990-1994)Online publication date: 29-Oct-2012
    • (2012)Segmenting web-domains and hashtags using length specific modelsProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398410(1113-1122)Online publication date: 29-Oct-2012
    • (2012)Extracting search-focused key n-grams for relevance ranking in web searchProceedings of the fifth ACM international conference on Web search and data mining10.1145/2124295.2124338(343-352)Online publication date: 8-Feb-2012
    • (2011)Clickthrough-based latent semantic models for web searchProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2010007(675-684)Online publication date: 24-Jul-2011
    • (2011)Web scale NLPProceedings of the 20th international conference on World wide web10.1145/1963405.1963457(357-366)Online publication date: 28-Mar-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media