Abstract
Web-search queries are typically short and ambiguous. To classify these queries into certain target categories is a difficult but important problem. In this article, we present a new technique called query enrichment, which takes a short query and maps it to intermediate objects. Based on the collected intermediate objects, the query is then mapped to target categories. To build the necessary mapping functions, we use an ensemble of search engines to produce an enrichment of the queries. Our technique was applied to the ACM Knowledge Discovery and Data Mining competition (ACM KDDCUP) in 2005, where we won the championship on all three evaluation metrics (precision, F1 measure, which combines precision and recall, and creativity, which is judged by the organizers) among a total of 33 teams worldwide. In this article, we show that, despite the difficulty of an abundance of ambiguous queries and lack of training data, our query-enrichment technique can solve the problem satisfactorily through a two-phase classification framework. We present a detailed description of our algorithm and experimental evaluation. Our best result for F1 and precision is 42.4% and 44.4%, respectively, which is 9.6% and 24.3% higher than those from the runner-ups, respectively.
- Alpaydin, E. 2004. Introduction to Machine Learning. MIT Press, Cambridge, MA.]] Google Scholar
- Bauer, E. and Kohavi, R. 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. J. Mach. Learn. 36, 1--2, 105--139.]] Google Scholar
- Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In KDD '00: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 407--416.]] Google Scholar
- Beitzel, S. M., Jensen, E. C., Frieder, O., Grossman, D., Lewis, D. D., Chowdhury, A., and Kolcz, A. 2005. Automatic web query classification using labeled and unlabeled training data. In SIGIR '05: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 581--582.]] Google Scholar
- Cann, A. J. 2003. Maths from Scratch for Biologists. John Wiley & Sons, New York, NY.]]Google Scholar
- Caruana, R., Niculescu-Mizil, A., Crew, G., and Ksikes, A. 2004. Ensemble selection from libraries of models. In ICML '04: Proceedings of the 21st International Conference on Machine Learning. ACM Press, New York, 18.]] Google Scholar
- Chang, C.-H. and Hsu, C.-C. 1998. Integrating query expansion and conceptual relevance feedback for personalized web information retrieval. In WWW7: Proceedings of the 7th International Conference on World Wide Web 7. Elsevier Science Publishers B. V., Amsterdam, 621--623.]] Google Scholar
- Chekuri, C., Goldwasser, M., Raghavan, P., and Upfal, E. 1997. Web search using automated classification. 6th International World Wide Web Conference (WWW6). Poster presentation.]]Google Scholar
- Chen, H. and Dumais, S. 2000. Bringing order to the web: Automatically categorizing search results. In CHI '00: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM Press, New York, 145--152.]] Google Scholar
- Dietterich, T. G. 2000. Ensemble methods in machine learning. In Multiple Classifier Systems. 1--15.]] Google Scholar
- Fan, W., Stolfo, S. J., and Zhang, J. 1999. The application of adaboost for distributed, scalable and on-line learning. In KDD '99: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 362--366.]] Google Scholar
- Hansen, L. K. and Salamon, P. 1990. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 10, 993--1001.]] Google Scholar
- HITEC. http://categorizer.tmit.bmr.hu.]]Google Scholar
- Hoel, P. G. 1966. Elementary Statistics, 2nd ed. Wiley, New York, NY.]]Google Scholar
- Howe, A. E. and Dreilinger, D. 1997. SAVVYSEARCH: A metasearch engine that learns which search engines to query. AI Mag. 18, 2, 19--25.]]Google Scholar
- Jansen, B. J. 2000. The effect of query complexity on web searching results. Inf. Res. 6, 1.]]Google Scholar
- Joachims, T. 1998. Text categorization with suport vector machines: Learning with many relevant features. In ECML'98 Proceedings of the 10th European Conference on Machine Learning, 137--142.]] Google Scholar
- Joachims, T. 1999. Transductive inference for text classification using support vector machines. In ICML '99: Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, 200--209.]] Google Scholar
- Jones, K. S. 1971. Automatic Keyword Classifications for Information Retrieval. Butterworth, London, UK.]]Google Scholar
- Kang, I.-H. and Kim, G. 2003. Query type classification for web document retrieval. In SIGIR '03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM Press, New York, 64--71.]] Google Scholar
- Kardkovács, Z. T., Tikk, D., and Bánsághi, Z. 2005. The ferrety algorithm for the kdd cup 2005 problem. SIGKDD Explor. Newsl. 7, 2, 111--116.]] Google Scholar
- Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. 1998. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20, 3, 226--239.]] Google Scholar
- Lewis, D. D. and Gale, W. A. 1994. A sequential algorithm for training text classifiers. In SIGIR, W. B. Croft and C. J. van Rijsbergen, Eds. Springer Verlag, Berlin, Germany, 3--12.]] Google Scholar
- Li, Y., Zheng, Z., and Dai, H. K. 2005. Kdd cup-2005 report: Facing a great challenge. SIGKDD Explor. Newsl. 7, 2, 91--99.]] Google Scholar
- McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization.]]Google Scholar
- Meyer, D. A. and Brown, T. A. 1998. Statistical mechanics of voting. Phys. Revei. Lett. 81, 8, 1718--1721.]]Google Scholar
- Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. 1990. Introduction to wordnet: An on-line lexical database. Int. J. Lexicography 3, 4, 23--244.]]Google Scholar
- Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford Digital Library Technologies Project.]]Google Scholar
- Selberg, E. and Etzioni, O. 1995. Multi-service search and comparison using the MetaCrawler. In Proceedings of the 4th International World-Wide web Conference. Darmstadt, Germany.]]Google Scholar
- Shen, D., Pan, R., Sun, J.-T., Pan, J. J., Wu, K., Yin, J., and Yang, Q. 2005. Q2c@ust: Our winning solution to query classification in kddcup 2005. SIGKDD Explor. Newsl. 7, 2, 100--110.]] Google Scholar
- Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. 2006. Building bridges for web query classification. In SIGIR '06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.]] Google Scholar
- Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 1, 6--12.]] Google Scholar
- Tikk, D., Biró, Gy., and Yang, J. D. 2005. Experiments with a hierarchial text categorization method on WIPO patent collections. In Applied Research in Uncertainty Modelling and Analysis, N. O. Attok-Okine and B. M. Ayyub, Eds. International Series in Intelligent Technologies, vol. 20, Springer-Verlag, 283--302.]]Google Scholar
- Van, R. C. 1979. Information Retrieval, 2nd ed. Butterworth, London, UK.]]Google Scholar
- Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., and Scheffer, T. 2005. Classifying search engine queries using the web as background knowledge. SIGKDD Explor. Newsl. 7, 2, 117--122.]] Google Scholar
- Voorhees, E. M. 1994. Query expansion using lexical-semantic relations. In SIGIR '94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer Verlag, Berlin, Germany. 61--69.]] Google Scholar
- Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inf. Syst. 20, 1, 59--81.]] Google Scholar
- Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inf. Retr. 1, 1--2, 69--90.]] Google Scholar
- Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In ICML '97: Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, 412--420.]] Google Scholar
Index Terms
- Query enrichment for web-query classification
Recommendations
Automatic web query classification using labeled and unlabeled training data
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalAccurate topical categorization of user queries allows for increased effectiveness, efficiency, and revenue potential in general-purpose web search systems. Such categorization becomes critical if the system is to return results not just from a general ...
Context-aware query classification
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalUnderstanding users'search intent expressed through their search queries is crucial to Web search and online advertisement. Web query classification (QC) has been widely studied for this purpose. Most previous QC algorithms classify individual queries ...
Q2C@UST: our winning solution to query classification in KDDCUP 2005
In this paper, we describe our ensemble-search based approach, Q2C@UST (http://webprojectl.cs.ust.hk/q2c/), for the query classification task for the KDDCUP 2005. There are two aspects to the key difficulties of this problem: one is that the meaning of ...
Comments