skip to main content
10.1145/3093241.3093281acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccdaConference Proceedingsconference-collections
research-article

A Statistical Learning Approach to Detect Abusive Twitter Accounts

Authors Info & Claims
Published:19 May 2017Publication History

ABSTRACT

The increased use of social media has motivated spammers to post their malicious activities on social network sites. Some of these spammers use adult content to further the distribution of their malicious activities. Moreover, the extensive number of users posting adult content in social media degrades the experience for other users for whom the adult content is not desired or appropriate. In this paper, we aim to detect abusive accounts that post adult content using Arabic language to target Arab speakers. There is limited natural language processing (NLP) resources for the Arabic language, and to the best of our knowledge no research has been done to detect adult accounts with Arabic language in social media. We used a statistical learning approach to analyze Twitter content to detect abusive accounts that use obscenity, profanity, slang, and swearing words in Arabic text format. Our approach achieved a predictive accuracy of 96% and overcomes imitations of the bag-of-word (BOW) approach.

References

  1. Abozinadah, E.A. and Jones Jr, J.H. IMPROVED MICROBLOG CLASSIFICATION FOR DETECTING ABUSIVE ARABIC TWITTER ACCOUNTS. International Journal of Data Mining & Knowledge Management Process (IJDKP). 6, 6, 17--28.Google ScholarGoogle Scholar
  2. Abozinadah, E.A., Mbaziira, A.V. and Jones, J.H.J. 2015. Detection of Abusive Accounts with Arabic Tweets. International Journal of Knowledge Engineering-IACSIT. 1, 2 (2015), 113--119. Google ScholarGoogle ScholarCross RefCross Ref
  3. Adult or sexual products and services: https://help.twitter.com/articles/20170427?lang=en. Accessed: 2017--04-03.Google ScholarGoogle Scholar
  4. Alsaleem, S. 2011. Automated Arabic Text Categorization Using SVM and NB. Int. Arab J. e-Technol. 2, 2 (2011), 124--128.Google ScholarGoogle Scholar
  5. Al-Sughaiyer, I.A. and Al-Kharashi, I.A. 2004. Arabic Morphological Analysis Techniques: A Comprehensive Survey. J. Am. Soc. Inf. Sci. Technol. 55, 3 (Feb. 2004), 189--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Azhagusundari, B. and Thanamani, A.S. 2013. Feature selection based on information gain. International Journal of Innovative Technology and Exploring Engineering (IJITEE). 2, 2 (2013), 18--21.Google ScholarGoogle Scholar
  7. Benevenuto, F., Magno, G., Rodrigues, T. and Almeida, V. 2010. Detecting spammers on twitter. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS (2010).Google ScholarGoogle Scholar
  8. Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A. and Fellbaum, C. 2006. Introducing the Arabic wordnet project. Proceedings of the Third International WordNet Conference (2006), 295--300.Google ScholarGoogle Scholar
  9. Brin, S. and Page, L. 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine. Proceedings of the Seventh International Conference on World Wide Web 7 (Amsterdam, The Netherlands, The Netherlands, 1998), 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bruns, A., Highfield, T. and Burgess, J. 2013. The Arab Spring and Social Media Audiences English and Arabic Twitter Users and Their Networks. American Behavioral Scientist. 57, 7 (Jul. 2013), 871--898. Google ScholarGoogle ScholarCross RefCross Ref
  11. Chaabane, A., Chen, T., Cunche, M., De Cristofaro, E., Friedman, A. and Kaafar, M.A. 2014. Censorship in the Wild: Analyzing Internet Filtering in Syria. Proceedings of the 2014 Conference on Internet Measurement Conference (New York, NY, USA, 2014), 285--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cheng, H., Xing, X., Liu, X. and Lv, Q. 2015. ISC: An Iterative Social Based Classifier for Adult Account Detection on Twitter. IEEE Transactions on Knowledge and Data Engineering. 27, 4 (Apr. 2015), 1045--1056. Google ScholarGoogle ScholarCross RefCross Ref
  13. Diab, M. and Habash, N. 2007. Arabic dialect processing tutorial. Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts (2007), 5--6.Google ScholarGoogle Scholar
  14. El Kourdi, M., Bensaid, A. and Rachidi, T. 2004. Automatic Arabic Document Categorization Based on the NaÏVe Bayes Algorithm. Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (Stroudsburg, PA, USA, 2004), 51--58. Google ScholarGoogle ScholarCross RefCross Ref
  15. Farghaly, A. and Shaalan, K. 2009. Arabic Natural Language Processing: Challenges and Solutions. 8, 4 (Dec. 2009), 14:1--14:22.Google ScholarGoogle Scholar
  16. Hamdan, H., Bechet, F. and Bellot, P. 2013. Experiments with DBpedia, WordNet and SentiWordNet as re- sources for sentiment analysis in micro-blogging. Second Joint Conference on Lexical and Computational Semantics. 2, (2013).Google ScholarGoogle Scholar
  17. Hatzivassiloglou, V. and McKeown, K.R. 1997. Predicting the Semantic Orientation of Adjectives. Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics (Stroudsburg, PA, USA, 1997), 174--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Irani, D., Webb, S. and Pu, C. 2010. Study of trend-stuffing on twitter through text classification. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS (2010).Google ScholarGoogle Scholar
  19. Joachims, T. 1998. Text categorization with Support Vector Machines: Learning with many relevant features. Machine Learning: ECML-98. C. Nédellec and C. Rouveirol, eds. Springer Berlin Heidelberg. 137--142.Google ScholarGoogle Scholar
  20. Kirtsy Has Porn, Users Blame Twitter: 2008. https://thenextweb.com/2008/12/23/kirtsy-has-porn-users-blame-twitter/. Accessed: 2017--04-06.Google ScholarGoogle Scholar
  21. Litvak, M., Last, M., Aizenman, H., Gobits, I. and Kandel, A. 2011. DegExt --- A Language-Independent Graph-Based Keyphrase Extractor. Advances in Intelligent Web Mastering -- 3. E. Mugellini, P.S. Szczepaniak, M.C. Pettenati, and M. Sokhn, eds. Springer Berlin Heidelberg. 121--130.Google ScholarGoogle Scholar
  22. Mamoun, R. and Ahmed, M. 2016. Arabic text stemming: Comparative analysis. 2016 Conference of Basic Sciences and Engineering Studies (SGCAC) (Feb. 2016), 88--93.Google ScholarGoogle Scholar
  23. McCord, M. and Chuah, M. 2011. Spam Detection on Twitter Using Traditional Classifiers. Autonomic and Trusted Computing. J.M.A. Calero, L.T. Yang, F.G. Mármol, L.J.G. Villalba, A.X. Li, and Y. Wang, eds. Springer Berlin Heidelberg. 175--186.Google ScholarGoogle Scholar
  24. Mihalcea, R. and Tarau, P. 2004. TextRank: Bringing Order into Texts. (2004), 404--411.Google ScholarGoogle Scholar
  25. Na, S.-H., Lee, Y., Nam, S.-H. and Lee, J.-H. 2009. Improving Opinion Retrieval Based on Query-Specific Sentiment Lexicon. Advances in Information Retrieval (Apr. 2009), 734--738.Google ScholarGoogle Scholar
  26. nltk.stem.isri --- NLTK 3.0 documentation: http://www.nltk.org/_modules/nltk/stem/isri.html. Accessed: 2016--10-07.Google ScholarGoogle Scholar
  27. Page, L. 2001. United States Patent: 6285999 - Method for node ranking in a linked database. 6285999. Sep. 4, 2001.Google ScholarGoogle Scholar
  28. Purohit, H., Hampton, A., Shalin, V.L., Sheth, A.P., Flach, J. and Bhatt, S. 2013. What kind of #conversation is Twitter? Mining #psycholinguistic cues for emergency coordination. Computers in Human Behavior. 29, 6 (Nov. 2013), 2438--2447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. PyArabic 0.5: Python Package Index: https://pypi.python.org/pypi/PyArabic/0.5. Accessed: 2016--10-07.Google ScholarGoogle Scholar
  30. Rafter, M.V. 2009. An open letter to Twitter: stop the porn spam. WordCount.Google ScholarGoogle Scholar
  31. Rsheed, N.A. and Khan, M.B. 2014. Predicting the Popularity of Trending Arabic News on Twitter. Proceedings of the 6th International Conference on Management of Emergent Digital EcoSystems (New York, NY, USA, 2014), 3:15--3:19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Saudi govt. agencies struggling to fight porn on social media: http://english.alarabiya.net/en/media/digital/2014/03/19/Saudi-govt-agencies-struggling-to-fight-porn-on-social-media.html. Accessed: 2014--11-21.Google ScholarGoogle Scholar
  33. Singh, M., Bansal, D. and Sofat, S. 2016. Behavioral analysis and classification of spammers distributing pornographic content in social media. Social Network Analysis and Mining. 6, 1 (Jun. 2016), 41.Google ScholarGoogle ScholarCross RefCross Ref
  34. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H. and Demirbas, M. 2010. Short Text Classification in Twitter to Improve Information Filtering. Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (New York, NY, USA, 2010), 841--842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. stop-words - Stop words - Google Project Hosting: https://code.google.com/p/stop-words/. Accessed: 2014--11-22.Google ScholarGoogle Scholar
  36. Tai, Y.-J. and Kao, H.-Y. 2013. Automatic Domain-Specific Sentiment Lexicon Generation with Label Propagation. Proceedings of International Conference on Information Integration and Web-based Applications & Services (New York, NY, USA, 2013), 53:53--53:62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Thabtah, F., Eljinini, M., Zamzeer, M. and Hadi, W. 'el 2009. Naïve Bayesian Based on Chi Square to Categorize Arabic. IBIMA. 10, (2009), 158--163.Google ScholarGoogle Scholar
  38. The Arab Spring: A Year Of Revolution: http://www.npr.org/2011/12/17/143897126/the-arab-spring-a-year-of-revolution. Accessed: 2017--03-01.Google ScholarGoogle Scholar
  39. Tong, S. and Koller, D. 2001. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. 2, Nov (2001), 45--66.Google ScholarGoogle Scholar
  40. Tong, S. and Koller, D. 2002. Support Vector Machine Active Learning with Applications to Text Classification. J. Mach. Learn. Res. 2, (Mar. 2002), 45--66.Google ScholarGoogle Scholar
  41. Turney, P.D. 2001. Mining the Web for Synonyms: PMI-IR Versus LSA on TOEFL. Proceedings of the 12th European Conference on Machine Learning (London, UK, UK, 2001), 491--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Turney, P.D. 2002. Thumbs Up or Thumbs Down?: Semantic Orientation Applied to Unsupervised Classification of Reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Stroudsburg, PA, USA, 2002), 417--424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wahsheh, H.A., Al-kabi, M.N. and Alsmadi, I.M. 2012. Evaluating Arabic Spam Classifiers Using Link Analysis. Proceedings of the 3rd International Conference on Information and Communication Systems (New York, NY, USA, 2012), 12:1--12:5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wahsheh, H.A., Al-Kabi, M.N. and Alsmadi, I.M. 2013. SPAR: A system to detect spam in Arabic opinions. 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT) (Dec. 2013), 1--6. Google ScholarGoogle ScholarCross RefCross Ref
  45. Wang, A.H. 2010. Don't follow me: Spam detection in Twitter. Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT) (Jul. 2010), 1--10.Google ScholarGoogle Scholar
  46. Wiley: An Introduction to Categorical Data Analysis, 2nd Edition - Alan Agresti: http://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP000369.html. Accessed: 2017--03-16.Google ScholarGoogle Scholar
  47. Winter, S. 2015. Do social media likes & followers = popularity?Google ScholarGoogle Scholar
  48. Wu, S., Hofman, J.M., Mason, W.A. and Watts, D.J. 2011. Who Says What to Whom on Twitter. Proceedings of the 20th International Conference on World Wide Web (New York, NY, USA, 2011), 705--714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yoon, T., Park, S.-Y. and Cho, H.-G. 2010. A Smart Filtering System for Newly Coined Profanities by Using Approximate String Alignment. 2010 IEEE 10th International Conference on Computer and Information Technology (CIT) (Jun. 2010), 643--650.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Statistical Learning Approach to Detect Abusive Twitter Accounts

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Other conferences
                ICCDA '17: Proceedings of the International Conference on Compute and Data Analysis
                May 2017
                307 pages
                ISBN:9781450352413
                DOI:10.1145/3093241

                Copyright © 2017 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 19 May 2017

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed limited

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader