research-article

Exploring web scale language models for search query processing

Authors:

C. Lee GilesAuthors Info & Claims

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 451 - 460

https://doi.org/10.1145/1772690.1772737

Published: 26 April 2010 Publication History

Abstract

It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title, and the anchor text. Our information theoretical analysis shows that queries seem to be composed in a way most similar to how authors summarize documents in anchor texts or titles, offering a quantitative explanation to the observations in past work.

We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones.

References

[1]

Hitwise 2009 press releases, 2009.

[2]

Special issue on web as corpus. Computational Linguistics, 29(3), September 2003.

[3]

E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of 29th international ACM conference on Research and development in information retrieval (SIGIR), pages 19--26, 2006.

Digital Library

[4]

M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of 39th Annual Meeting on Association for Computational Linguistics (ACL), pages 26--33, 2001.

Digital Library

[5]

C. Barr, R. Jones, and M. Regelson. The linguistic structure of english web-search queries. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1021--1030, 2008.

Digital Library

[6]

S. Bergsma, D. Lin, and R. Goebel. Web-scale n-gram models for lexical disambiguation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pages 1507--1512, 2009.

Digital Library

[7]

S. Bergsma and Q. I. Wang. Learning noun phrase query segmentation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), pages 819--826, 2007.

[8]

T. Brants and A. Franz. Web 1T 5-gram corpus version 1.1. Technical report, Google Research, 2006.

[9]

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), pages 858--867, 2007.

[10]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(10):359--394, 1999.

Digital Library

[11]

K. Church, T. Hard, and J. Gao. Compressing trigram language models with Golomb coding. In Proceedings of EMNLP and CoNLL, pages 199--207, 2007.

[12]

S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, pages 293--300, 2004.

[13]

M. Gamon, J. Gao, C. Brockett, A. Klementiev, W. Dolan, D. Belenko, and L. Vanderwende. Using contextual speller techniques and language modeling for ESL error correction. In Proc. of IJCNLP, 2008.

[14]

J. Gao, J. Goodman, and J. Miao. The use of clustering techniques for language modelling - application to Asian languages. Computational Linguistics and Chinese Language Processing, 6(1):27--60, 2001.

[15]

J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In Proceedings of the 32nd international SIGIR conference on Research and development in information retrieval (SIGIR), pages 355--362, 2009.

Digital Library

[16]

A. R. Golding and Y. Schabes. Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proceedings of the 34th ACL, pages 71--78, 1996.

Digital Library

[17]

A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.

Digital Library

[18]

X. Huang, A. Acero, and H.-W. Hon. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR, 2001.

Digital Library

[19]

R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proc. of 15th World Wide Web (WWW), pages 387--396, 2006.

Digital Library

[20]

R. Kneser and H. Ney. Improved backing-off for M-gram language modeling. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1:181--184, 1995.

[21]

G. Kumaran and V. R. Carvalho. Reducing long queries using query quality predictors. In Proc. of 32nd international conf. on Research and development in information retrieval (SIGIR), pages 564--571, 2009.

Digital Library

[22]

M. Lapata and F. Keller. The web as a baseline: Evaluating the performance of unsupervised web-based models for a range of NLP tasks. In Proc. of Human Language Technologies - North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 121--128, 2004.

[23]

M. Lauer. Corpus statistics meet the noun compound: some empirical results. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL), pages 47--54, 1995.

Digital Library

[24]

P. Nakov and M. Hearst. Search engine statistics beyond the n-gram: Application to noun compound bracketing. In Proc. of 9th Conf. on Computational Natural Language Learning, pages 17--24, 2005.

Digital Library

[25]

P. Nguyen, J. Gao, and M. Mahajan. MSRLM: a scalable language modeling toolkit. Technical report TR-2007-144, Microsoft Research, 2007.

[26]

A. Spink, D. Wolfram, M. B. J. Jansen, and T. Saracevic. Searching the web: the public and their queries. Journal of American Society for Information Science and Technology, 52(3):226--234, 2001.

Digital Library

[27]

K. Svore and C. Burges. A machine learning approach for improved bm25 retrieval. In Proceedings of 18th ACM Conference on Information and Knowledge Management (CIKM), pages 1811--1814, 2009.

Digital Library

[28]

B. Tan and F. Peng. Unsupervised query segmentation using generative language models and Wikipedia. In Proceeding of the 17th international conference on World Wide Web (WWW), pages 347--356, 2008.

Digital Library

[29]

D. Vadas and J. R. Curran. Corpus statistics meet the noun compound: some empirical results. In Proceedings of 10th Conference of the Pacific Association for Computational Linguistics (PACLING), pages 104--112, 2007.

[30]

K. Wang and X. Li. Efficacy of a constantly adaptive language modeling technique for web-scale applications. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4733--4736, 2009.

Digital Library

[31]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004.

Digital Library

Cited By

Nugues PNugues P(2024)Word SequencesPython for Natural Language Processing10.1007/978-3-031-57549-5_10(253-283)Online publication date: 6-Mar-2024
https://doi.org/10.1007/978-3-031-57549-5_10
Liu SWang XCollins CDou WOuyang FEl-Assady MJiang LKeim D(2019)Bridging Text Visualization and Mining: A Task-Driven SurveyIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2018.283434125:7(2482-2504)Online publication date: 1-Jul-2019
https://doi.org/10.1109/TVCG.2018.2834341
Salehi BLiu FBaldwin TWong WSong DLiu TSun LBruza PMelucci MSebastiani FYang G(2018)Multitask Learning for Query Segmentation in Job SearchProceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3234944.3234965(179-182)Online publication date: 10-Sep-2018
https://dl.acm.org/doi/10.1145/3234944.3234965
Show More Cited By

Index Terms

Exploring web scale language models for search query processing
1. Information systems
  1. Information retrieval

Recommendations

A graph query language and its query processing
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
View-based query processing: On the relationship between rewriting, answering and losslessness

As a result of the extensive research in view-based query processing, three notions have been identified as fundamental, namely rewriting, answering, and losslessness. Answering amounts to computing the tuples satisfying the query in all databases ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '10: Proceedings of the 19th international conference on World wide web

April 2010

1407 pages

ISBN:9781605587998

DOI:10.1145/1772690

General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India

Copyright © 2010 International World Wide Web Conference Committee (IW3C2).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '10

WWW '10: The 19th International World Wide Web Conference

April 26 - 30, 2010

North Carolina, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
715
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)2

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nugues PNugues P(2024)Word SequencesPython for Natural Language Processing10.1007/978-3-031-57549-5_10(253-283)Online publication date: 6-Mar-2024
https://doi.org/10.1007/978-3-031-57549-5_10
Liu SWang XCollins CDou WOuyang FEl-Assady MJiang LKeim D(2019)Bridging Text Visualization and Mining: A Task-Driven SurveyIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2018.283434125:7(2482-2504)Online publication date: 1-Jul-2019
https://doi.org/10.1109/TVCG.2018.2834341
Salehi BLiu FBaldwin TWong WSong DLiu TSun LBruza PMelucci MSebastiani FYang G(2018)Multitask Learning for Query Segmentation in Job SearchProceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3234944.3234965(179-182)Online publication date: 10-Sep-2018
https://dl.acm.org/doi/10.1145/3234944.3234965
Balog KBalog K(2018)Understanding Information NeedsEntity-Oriented Search10.1007/978-3-319-93935-3_7(225-267)Online publication date: 3-Oct-2018
https://doi.org/10.1007/978-3-319-93935-3_7
Crabtree DGao XAndreae P(2016)Query aspects approach to web searchWeb Intelligence10.3233/WEB-16033814:3(173-197)Online publication date: 4-Aug-2016
https://doi.org/10.3233/WEB-160338
Saha Roy RSuresh AGanguly NChoudhury M(2016)Improving Document Ranking for Long Queries with Nested Query SegmentationAdvances in Information Retrieval10.1007/978-3-319-30671-1_67(775-781)Online publication date: 2016
https://doi.org/10.1007/978-3-319-30671-1_67
Izadinia HSadeghi FDivvala SHajishirzi HChoi YFarhadi A(2015)Segment-Phrase Table for Semantic Segmentation, Visual Entailment and ParaphrasingProceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)10.1109/ICCV.2015.10(10-18)Online publication date: 7-Dec-2015
https://dl.acm.org/doi/10.1109/ICCV.2015.10
Sadeghi FDivvala SFarhadi A(2015)VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2015.7298752(1456-1464)Online publication date: Jun-2015
https://doi.org/10.1109/CVPR.2015.7298752
Saha Roy RKatare RGanguly NLaxman SChoudhury M(2015)Discovering and understanding word level user intent in Web search queriesWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2014.07.01030:C(22-38)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1016/j.websem.2014.07.010
Song YWang HChen WWang SLi JWang XGarofalakis MSoboroff ISuel TWang M(2014)Transfer Understanding from Head Queries to Tail QueriesProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2662078(1299-1308)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2661829.2662078
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

EPUB

View this article in ePub.

Figures

Tables

Media

View Table of Conten