research-article

Large-scale bot detection for search engines

Authors:

Zijian ZhengAuthors Info & Claims

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 501 - 510

https://doi.org/10.1145/1772690.1772742

Published: 26 April 2010 Publication History

Abstract

In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traffic from that of genuine human users. The work is motivated by the challenge that the enormous amount of search data pose to traditional approaches that rely on fully annotated training samples. We propose a semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs a large set of training samples with initial labels, though directly using these training data is problematic because the data thus sampled are biased. To tackle this problem, we further develop a semi-supervised learning algorithm to take advantage of the unlabeled data to improve the classification performance. These two proposed algorithms can be seamlessly combined and very cost efficient to scale the training process. In our experiment, the proposed approach showed significant (i.e. 2:1) improvement compared to the traditional supervised approach.

References

[1]

R. A. Baeza-Yates, C. A. Hurtado, M. Mendoza, and G. Dupret. Modeling user search behavior. In LA-WEB, pages 242--251. IEEE Computer Society, 2005.

Digital Library

[2]

K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In Proceedings of the 1998 conference on Advances in neural information processing systems II, pages 368--374, Cambridge, MA, USA, 1999. MIT Press.

Digital Library

[3]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT' 98: Proceedings of the eleventh annual conference on Computational learning theory, pages 92--100, New York, NY, USA, 1998. ACM.

Digital Library

[4]

G. Buehrer, J. W. Stokes, and K. Chellapilla. A large-scale study of automated web search traffic. In AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 1--8, New York, NY, USA, 2008. ACM.

Digital Library

[5]

G. Buehrer, J. W. Stokes, K. Chellapilla, and J. C. Platt. Classification of automated search traffic. In I. King and R. A. Baeza-Yates, editors, Weaving Services and People on the World Wide Web, pages 3--26. Springer, 2009.

[6]

O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006.

Digital Library

[7]

N. V. Chawla and G. J. Karakoulas. Learning from labeled and unlabeled data: An empirical study across techniques and domains. J. Artif. Intell. Res. (JAIR), 23:331--366, 2005.

Digital Library

[8]

D. Chickering, D. Geiger, and D. Heckerman. Learning bayesian networks is np-hard. Technical report, Microsoft Research, 1994.

[9]

C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462--467, 1968.

Digital Library

[10]

R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst., 1(1):5--32, 1999.

Digital Library

[11]

N. Daswani and M. Stoppelman. The anatomy of clickbot.a. In HotBots'07: Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets, pages 11--11, Berkeley, CA, USA, 2007. USENIX Association.

Digital Library

[12]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.

[13]

F. Denis, A. Laurent, R. Gilleron, and M. Tommasi. Text classification and co-training from positive and unlabeled examples. In Proceedings of the ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data, pages 80--87, 2003.

[14]

Z. Dou, R. Song, X. Yuan, and J.-R. Wen. Are click-through data adequate for learning web search rankings? In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 73--82, New York, NY, USA, 2008. ACM.

Digital Library

[15]

D. Eichmann. Ethical web agents. Comput. Netw. ISDN Syst., 28(1-2):127--136, 1995.

Digital Library

[16]

N. Friedman, D. Geiger, M. Goldszmidt, G. Provan, P. Langley, and P. Smyth. Bayesian network classifiers. In Machine Learning, pages 131--163, 1997.

Digital Library

[17]

G. Fung and O. Mangasarian. Semi-supervised support vector machines for unlabeled data classification, 2001.

[18]

Z. Ghahramani. An introduction to hidden markov models and bayesian networks. pages 9--42, 2002.

Digital Library

[19]

S. A. Goldman and Y. Zhou. Enhancing supervised learning with unlabeled data. In ICML '00: Proceedings of the Seventeenth International Conference on Machine Learning, pages 327--334, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

Digital Library

[20]

D. Heckerman. A tutorial on learning with bayesian networks. pages 301--354, 1999.

[21]

O. Heinonen, K. Hatonen, and M. Klemettinen. WWW robots and search engines. In K. Oksanen, editor, Seminar on Mobile Code, Technical Report TKO-C79. Helsinki University of Technology, Department of Computer Science, May 1996.

[22]

C. Holscher and G. Strube. Web search behavior of internet experts and newbies. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, pages 337--346, Amsterdam, The Netherlands, The Netherlands, 2000. North-Holland Publishing Co.

Digital Library

[23]

T. Joachims. Transductive inference for text classification using support vector machines. In ICML'99: Proceedings of the Sixteenth International Conference on Machine Learning, pages 200--209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

Digital Library

[24]

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

Digital Library

[25]

M. Koster. Robots in the web: threat or treat ? ConneXions, 9(4), April 1995.

[26]

W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In T. Fawcett and N. Mishra, editors, ICML, pages 448--455. AAAI Press, 2003.

[27]

B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classification of text documents. In ICML'02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 387--394, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.

Digital Library

[28]

T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In LR4IR 2007, in conjunction with SIGIR 2007, 2007.

Digital Library

[29]

K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM '00: Proceedings of the ninth international conference on Information and knowledge management, pages 86--93, New York, NY, USA, 2000. ACM.

Digital Library

[30]

N. Provos. The reason behind "we're sorry ..." message. http://googleonlinesecurity.blogspot.com/2007/07/reason-behind-were-sorry-message.html, July 2007.

[31]

N. Provos, J. McClain, and K. Wang. Search worms. In WORM '06: Proceedings of the 4th ACM workshop on Recurring malcode, pages 1--8, New York, NY, USA, 2006. ACM.

Digital Library

[32]

J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.

Digital Library

[33]

L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. pages 267--296, 1990.

Digital Library

[34]

R. B. Remco. Bayesian network classifiers in weka. Technical report, University of Waikato, 2004.

[35]

E. Riloff, J. Wiebe, and T. Wilson. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 25--32, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

Digital Library

[36]

C. Rosenberg, M. Hebert, and H. Schneiderman. Semi-supervised self-training of object detection models. In Seventh IEEE Workshop on Applications of Computer Vision, January 2005.

Digital Library

[37]

N. Sadagopan and J. Li. Characterizing typical and atypical user sessions in clickstreams. In WWW '08: Proceeding of the 17th international conference on World Wide Web, pages 885--894, New York, NY, USA, 2008. ACM.

Digital Library

[38]

M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2001.

[39]

A. Stassopoulou and M. D. Dikaiakos. Web robot detection: A probabilistic reasoning approach. Comput. Netw., 53(3):265--278, 2009.

Digital Library

[40]

P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov., 6(1):9--35, 2002.

Digital Library

[41]

V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.

Digital Library

[42]

L. von Ahn, M. Blum, N. J. Hopper, and J. Langford. Captcha: Using hard ai problems for security. In In Proceedings of Eurocrypt, volume 2656, pages 294--311, 2003.

Digital Library

[43]

I. H. Witten and E. Frank. Data mining: practical machine learning tools and techniques with java implementations. SIGMOD Rec., 31(1):76--77, 2002.

Digital Library

[44]

D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 189--196, Morristown, NJ, USA, 1995. Association for Computational Linguistics.

Digital Library

[45]

Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, and E. Gillum. Botgraph: large scale spamming botnet detection. In NSDI'09: Proceedings of the 6th USENIX symposium on Networked systems design and implementation, pages 321--334, Berkeley, CA, USA, 2009. USENIX Association.

Digital Library

[46]

X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.

[47]

X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 58--65, 2003.

Cited By

Hemmatpour MZheng CZilberman N(2024)E-Commerce Bot Traffic: In-Network Impact, Detection, and Mitigation2024 27th Conference on Innovation in Clouds, Internet and Networks (ICIN)10.1109/ICIN60470.2024.10494459(179-185)Online publication date: 11-Mar-2024
https://doi.org/10.1109/ICIN60470.2024.10494459
Jastrzębska AOwsiński JOpara KGajewski MHryniewicz OKozakiewicz MZadrożny SZwierzchowski TJastrzębska AOwsiński JOpara KGajewski MHryniewicz OKozakiewicz MZadrożny SZwierzchowski T(2023)The Problem and Its Key CharacteristicsAnalysing Web Traffic10.1007/978-3-031-32503-8_1(1-14)Online publication date: 27-Jun-2023
https://doi.org/10.1007/978-3-031-32503-8_1
Yang DLi ZWang XSalamatian KXie G(2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
https://doi.org/10.1007/s11390-021-0218-2
Show More Cited By

Index Terms

Large-scale bot detection for search engines
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Multi-Label Regularized Generative Model for Semi-Supervised Collective Classification in Large-Scale Networks

The problem of collective classification (CC) for large-scale network data has received considerable attention in the last decade. Enabling CC usually increases accuracy when given a fully-labeled network with a large amount of labeled data. However, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '10: Proceedings of the 19th international conference on World wide web

April 2010

1407 pages

ISBN:9781605587998

DOI:10.1145/1772690

General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India

Copyright © 2010 International World Wide Web Conference Committee (IW3C2).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '10

WWW '10: The 19th International World Wide Web Conference

April 26 - 30, 2010

North Carolina, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
686
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hemmatpour MZheng CZilberman N(2024)E-Commerce Bot Traffic: In-Network Impact, Detection, and Mitigation2024 27th Conference on Innovation in Clouds, Internet and Networks (ICIN)10.1109/ICIN60470.2024.10494459(179-185)Online publication date: 11-Mar-2024
https://doi.org/10.1109/ICIN60470.2024.10494459
Jastrzębska AOwsiński JOpara KGajewski MHryniewicz OKozakiewicz MZadrożny SZwierzchowski TJastrzębska AOwsiński JOpara KGajewski MHryniewicz OKozakiewicz MZadrożny SZwierzchowski T(2023)The Problem and Its Key CharacteristicsAnalysing Web Traffic10.1007/978-3-031-32503-8_1(1-14)Online publication date: 27-Jun-2023
https://doi.org/10.1007/978-3-031-32503-8_1
Yang DLi ZWang XSalamatian KXie G(2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
https://doi.org/10.1007/s11390-021-0218-2
Lagopoulos ATsoumakas G(2020)Content-aware web robot detectionApplied Intelligence10.1007/s10489-020-01754-9Online publication date: 7-Jul-2020
https://doi.org/10.1007/s10489-020-01754-9
Zhang XWang MZhao BLiu RZhang JYang H(2020)Characterizing Robotic and Organic Query in SPARQL Search SessionsWeb and Big Data10.1007/978-3-030-60259-8_21(270-285)Online publication date: 16-Oct-2020
https://doi.org/10.1007/978-3-030-60259-8_21
Nagaraja SShah R(2019)ClicktokProceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks10.1145/3317549.3323407(105-116)Online publication date: 15-May-2019
https://dl.acm.org/doi/10.1145/3317549.3323407
Hall ATerveen LHalfaker A(2018)Bot Detection in Wikidata Using Behavioral and Other Informal CuesProceedings of the ACM on Human-Computer Interaction10.1145/32743332:CSCW(1-18)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3274333
Shakiba TZarifzadeh SDerhami V(2018)Spam query detection using stream clusteringWorld Wide Web10.1007/s11280-017-0471-z21:2(557-572)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1007/s11280-017-0471-z
Xu HLi ZChu CChen YYang YLu HWang HStavrou A(2018)Detecting and Characterizing Web Bot Traffic in a Large E-commerce MarketplaceComputer Security10.1007/978-3-319-98989-1_8(143-163)Online publication date: 7-Aug-2018
https://doi.org/10.1007/978-3-319-98989-1_8
Chen ZYu XZhang CZhang JLin CSong BGao JHu XYang WYan E(2017)Fast botnet detection from streaming logs using online lanczos method2017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258074(1408-1417)Online publication date: Dec-2017
https://doi.org/10.1109/BigData.2017.8258074
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

EPUB

View this article in ePub.

Figures

Tables

Media

View Table of Conten