skip to main content
10.1145/1772690.1772742acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Large-scale bot detection for search engines

Published: 26 April 2010 Publication History

Abstract

In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traffic from that of genuine human users. The work is motivated by the challenge that the enormous amount of search data pose to traditional approaches that rely on fully annotated training samples. We propose a semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs a large set of training samples with initial labels, though directly using these training data is problematic because the data thus sampled are biased. To tackle this problem, we further develop a semi-supervised learning algorithm to take advantage of the unlabeled data to improve the classification performance. These two proposed algorithms can be seamlessly combined and very cost efficient to scale the training process. In our experiment, the proposed approach showed significant (i.e. 2:1) improvement compared to the traditional supervised approach.

References

[1]
R. A. Baeza-Yates, C. A. Hurtado, M. Mendoza, and G. Dupret. Modeling user search behavior. In LA-WEB, pages 242--251. IEEE Computer Society, 2005.
[2]
K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In Proceedings of the 1998 conference on Advances in neural information processing systems II, pages 368--374, Cambridge, MA, USA, 1999. MIT Press.
[3]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT' 98: Proceedings of the eleventh annual conference on Computational learning theory, pages 92--100, New York, NY, USA, 1998. ACM.
[4]
G. Buehrer, J. W. Stokes, and K. Chellapilla. A large-scale study of automated web search traffic. In AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 1--8, New York, NY, USA, 2008. ACM.
[5]
G. Buehrer, J. W. Stokes, K. Chellapilla, and J. C. Platt. Classification of automated search traffic. In I. King and R. A. Baeza-Yates, editors, Weaving Services and People on the World Wide Web, pages 3--26. Springer, 2009.
[6]
O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006.
[7]
N. V. Chawla and G. J. Karakoulas. Learning from labeled and unlabeled data: An empirical study across techniques and domains. J. Artif. Intell. Res. (JAIR), 23:331--366, 2005.
[8]
D. Chickering, D. Geiger, and D. Heckerman. Learning bayesian networks is np-hard. Technical report, Microsoft Research, 1994.
[9]
C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462--467, 1968.
[10]
R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst., 1(1):5--32, 1999.
[11]
N. Daswani and M. Stoppelman. The anatomy of clickbot.a. In HotBots'07: Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets, pages 11--11, Berkeley, CA, USA, 2007. USENIX Association.
[12]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.
[13]
F. Denis, A. Laurent, R. Gilleron, and M. Tommasi. Text classification and co-training from positive and unlabeled examples. In Proceedings of the ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data, pages 80--87, 2003.
[14]
Z. Dou, R. Song, X. Yuan, and J.-R. Wen. Are click-through data adequate for learning web search rankings? In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 73--82, New York, NY, USA, 2008. ACM.
[15]
D. Eichmann. Ethical web agents. Comput. Netw. ISDN Syst., 28(1-2):127--136, 1995.
[16]
N. Friedman, D. Geiger, M. Goldszmidt, G. Provan, P. Langley, and P. Smyth. Bayesian network classifiers. In Machine Learning, pages 131--163, 1997.
[17]
G. Fung and O. Mangasarian. Semi-supervised support vector machines for unlabeled data classification, 2001.
[18]
Z. Ghahramani. An introduction to hidden markov models and bayesian networks. pages 9--42, 2002.
[19]
S. A. Goldman and Y. Zhou. Enhancing supervised learning with unlabeled data. In ICML '00: Proceedings of the Seventeenth International Conference on Machine Learning, pages 327--334, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[20]
D. Heckerman. A tutorial on learning with bayesian networks. pages 301--354, 1999.
[21]
O. Heinonen, K. Hatonen, and M. Klemettinen. WWW robots and search engines. In K. Oksanen, editor, Seminar on Mobile Code, Technical Report TKO-C79. Helsinki University of Technology, Department of Computer Science, May 1996.
[22]
C. Holscher and G. Strube. Web search behavior of internet experts and newbies. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, pages 337--346, Amsterdam, The Netherlands, The Netherlands, 2000. North-Holland Publishing Co.
[23]
T. Joachims. Transductive inference for text classification using support vector machines. In ICML'99: Proceedings of the Sixteenth International Conference on Machine Learning, pages 200--209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
[24]
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
[25]
M. Koster. Robots in the web: threat or treat ? ConneXions, 9(4), April 1995.
[26]
W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In T. Fawcett and N. Mishra, editors, ICML, pages 448--455. AAAI Press, 2003.
[27]
B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classification of text documents. In ICML'02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 387--394, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
[28]
T.-Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In LR4IR 2007, in conjunction with SIGIR 2007, 2007.
[29]
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM '00: Proceedings of the ninth international conference on Information and knowledge management, pages 86--93, New York, NY, USA, 2000. ACM.
[30]
N. Provos. The reason behind "we're sorry ..." message. http://googleonlinesecurity.blogspot.com/2007/07/reason-behind-were-sorry-message.html, July 2007.
[31]
N. Provos, J. McClain, and K. Wang. Search worms. In WORM '06: Proceedings of the 4th ACM workshop on Recurring malcode, pages 1--8, New York, NY, USA, 2006. ACM.
[32]
J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
[33]
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. pages 267--296, 1990.
[34]
R. B. Remco. Bayesian network classifiers in weka. Technical report, University of Waikato, 2004.
[35]
E. Riloff, J. Wiebe, and T. Wilson. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 25--32, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
[36]
C. Rosenberg, M. Hebert, and H. Schneiderman. Semi-supervised self-training of object detection models. In Seventh IEEE Workshop on Applications of Computer Vision, January 2005.
[37]
N. Sadagopan and J. Li. Characterizing typical and atypical user sessions in clickstreams. In WWW '08: Proceeding of the 17th international conference on World Wide Web, pages 885--894, New York, NY, USA, 2008. ACM.
[38]
M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2001.
[39]
A. Stassopoulou and M. D. Dikaiakos. Web robot detection: A probabilistic reasoning approach. Comput. Netw., 53(3):265--278, 2009.
[40]
P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov., 6(1):9--35, 2002.
[41]
V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.
[42]
L. von Ahn, M. Blum, N. J. Hopper, and J. Langford. Captcha: Using hard ai problems for security. In In Proceedings of Eurocrypt, volume 2656, pages 294--311, 2003.
[43]
I. H. Witten and E. Frank. Data mining: practical machine learning tools and techniques with java implementations. SIGMOD Rec., 31(1):76--77, 2002.
[44]
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 189--196, Morristown, NJ, USA, 1995. Association for Computational Linguistics.
[45]
Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, and E. Gillum. Botgraph: large scale spamming botnet detection. In NSDI'09: Proceedings of the 6th USENIX symposium on Networked systems design and implementation, pages 321--334, Berkeley, CA, USA, 2009. USENIX Association.
[46]
X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.
[47]
X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 58--65, 2003.

Cited By

View all
  • (2024)E-Commerce Bot Traffic: In-Network Impact, Detection, and Mitigation2024 27th Conference on Innovation in Clouds, Internet and Networks (ICIN)10.1109/ICIN60470.2024.10494459(179-185)Online publication date: 11-Mar-2024
  • (2023)The Problem and Its Key CharacteristicsAnalysing Web Traffic10.1007/978-3-031-32503-8_1(1-14)Online publication date: 27-Jun-2023
  • (2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
  • Show More Cited By

Index Terms

  1. Large-scale bot detection for search engines

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '10: Proceedings of the 19th international conference on World wide web
    April 2010
    1407 pages
    ISBN:9781605587998
    DOI:10.1145/1772690

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 April 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bot detection
    2. captcha
    3. click logs
    4. query logs
    5. search engine
    6. semi-supervised learning

    Qualifiers

    • Research-article

    Conference

    WWW '10
    WWW '10: The 19th International World Wide Web Conference
    April 26 - 30, 2010
    North Carolina, Raleigh, USA

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)E-Commerce Bot Traffic: In-Network Impact, Detection, and Mitigation2024 27th Conference on Innovation in Clouds, Internet and Networks (ICIN)10.1109/ICIN60470.2024.10494459(179-185)Online publication date: 11-Mar-2024
    • (2023)The Problem and Its Key CharacteristicsAnalysing Web Traffic10.1007/978-3-031-32503-8_1(1-14)Online publication date: 27-Jun-2023
    • (2021)Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web SearchJournal of Computer Science and Technology10.1007/s11390-021-0218-236:5(1167-1183)Online publication date: 30-Sep-2021
    • (2020)Content-aware web robot detectionApplied Intelligence10.1007/s10489-020-01754-9Online publication date: 7-Jul-2020
    • (2020)Characterizing Robotic and Organic Query in SPARQL Search SessionsWeb and Big Data10.1007/978-3-030-60259-8_21(270-285)Online publication date: 16-Oct-2020
    • (2019)ClicktokProceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks10.1145/3317549.3323407(105-116)Online publication date: 15-May-2019
    • (2018)Bot Detection in Wikidata Using Behavioral and Other Informal CuesProceedings of the ACM on Human-Computer Interaction10.1145/32743332:CSCW(1-18)Online publication date: 1-Nov-2018
    • (2018)Spam query detection using stream clusteringWorld Wide Web10.1007/s11280-017-0471-z21:2(557-572)Online publication date: 1-Mar-2018
    • (2018)Detecting and Characterizing Web Bot Traffic in a Large E-commerce MarketplaceComputer Security10.1007/978-3-319-98989-1_8(143-163)Online publication date: 7-Aug-2018
    • (2017)Fast botnet detection from streaming logs using online lanczos method2017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258074(1408-1417)Online publication date: Dec-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    EPUB

    View this article in ePub.

    ePub

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media