skip to main content
10.1145/2983323.2983740acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Vandalism Detection in Wikidata

Published:24 October 2016Publication History

ABSTRACT

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).

References

  1. B. Adler, L. de Alfaro, and I. Pye. Detecting Wikipedia Vandalism Using WikiTrust. CLEF Notebooks 2010.Google ScholarGoogle Scholar
  2. B. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features. CICLing 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding High-Quality Content in Social Media. WSDM 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201: 81--105, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. SIGMOD 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Boyd, V. Santos Costa, J. Davis, C. D. Page. Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation. ICML 2012.Google ScholarGoogle Scholar
  7. L. Breiman. Bagging Predictors. Machine learning, 24 (2): 123--140, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Breiman. Random Forests. Machine Learning, 45 (1): 5--32, Oct. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Burgstaller-Muehlbacher, A. Waagmeester, E. Mitraka, J. Turner, T. E. Putman, J. Leong, P. Pavlidis, L. Schriml, B. M. Good, and A. I. Su. Wikidata as a Semantic Framework for the Gene Wiki Initiative. bioRxiv 032144, 2015.Google ScholarGoogle Scholar
  10. J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. ICML 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Fawcett. An introduction to ROC analysis. Pattern recognition letters, 27 (8): 861--874, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Gärtner, P. Flach, A. Kowalczyk, and A. Smola. Multi-Instance Kernels. ICML 2002.Google ScholarGoogle Scholar
  13. H. He and E. Garcia. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21 (9): 1263--1284, Sept. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Heindorf, M. Potthast, B. Stein, and G. Engels. Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis. SIGIR 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. IPligence. Ipligence. http://www.ipligence.com, 2014.Google ScholarGoogle Scholar
  16. K. Y. Itakura and C. L. A. Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. SIGIR 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Ladsgroup and A. Halfaker. Wikidata features. https://github.com/wiki-ai/wb-vandalism/blob/ 31d74f8a50a8c43dd446d41cafee89ada5a051f8/wb_vandalism/feature_lists/wikidata. py.Google ScholarGoogle Scholar
  18. B. Li, T. Jin, M. R. Lyu, I. King, and B. Mak. Analyzing and predicting question quality in community question answering services. WWW 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  20. E. Mitraka, A. Waagmeester, S. Burgstaller-Muehlbacher, L. M. Schriml, A. I. Su, and B. M. Good. Wikidata: A platform for data integration and dissemination for the life sciences and beyond. bioRxiv 031971, 2015.Google ScholarGoogle Scholar
  21. S. M. Mola-Velasco. Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals: Lab Report for PAN at CLEF 2010. CLEF Notebooks 2010.Google ScholarGoogle Scholar
  22. P. Neis, M. Goetz, and A. Zipf. Towards Automatic Vandalism Detection in OpenStreetMap. ISPRS International Journal of Geo-Information, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  23. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825--2830, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Potthast, B. Stein, and R. Gerling. Automatic Vandalism Detection in Wikipedia. ECIR 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Ramaswamy, R. Tummalapenta, K. Li, and C. Pu. A Content-Context-Centric Approach for Detecting Vandalism in Wikipedia. Collaboratecom 2013.Google ScholarGoogle Scholar
  26. C. Shah and J. Pomerantz. Evaluating and Predicting Answer Quality in Community QA. SIGIR 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. H. Tan, E. Agichtein, P. Ipeirotis, and E. Gabrilovich. Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation. WSDM 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. P. Tanon, D. Vrandecic, S. Schaffert, T. Steiner, and L. Pintscher. From Freebase to Wikidata: The Great Migration. WWW 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K.-N. Tran and P. Christen. Cross Language Prediction of Vandalism on Wikipedia Using Article Views and Revisions. PAKDD 2013.Google ScholarGoogle Scholar
  30. K.-N. Tran and P. Christen. Cross-Language Learning from Bots and Users to Detect Vandalism on Wikipedia. IEEE Transactions on Knowledge and Data Engineering, 27 (3): 673--685, Mar. 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K.-N. Tran, P. Christen, S. Sanner, and L. Xie. Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages. PAKDD 2015.Google ScholarGoogle Scholar
  32. L. von Ahn. Offensive/Profane Word List. http://www.cs.cmu.edu/ biglou/resources/, 2009.Google ScholarGoogle Scholar
  33. D. Vrandeičć and M. Krötzsch. Wikidata: A Free Collaborative Knowledgebase. Communications of the ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Y. Wang and K. R. McKeown. "Got You!": Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-semantic Modeling. COLING 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. West and I. Lee. Multilingual Vandalism Detection Using Language-Independent & Ex Post Facto Evidence. CLEF Notebooks 2011.Google ScholarGoogle Scholar
  36. A. G. West, S. Kannan, and I. Lee. Detecting Wikipedia Vandalism via Spatio-temporal Analysis of Revision Metadata. EUROSEC 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Q. Wu, D. Irani, C. Pu, and L. Ramaswamy. Elusive Vandalism Detection in Wikipedia: A Text Stability-based Approach. CIKM 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wikimedia Foundation. Wikidata Abuse Filter. https://www.wikidata.org/wiki/Special:AbuseFilter, 2015.Google ScholarGoogle Scholar
  39. Wikimedia Foundation. Objective Revision Evaluation Service. https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service, 2016.Google ScholarGoogle Scholar
  40. Wikimedia Foundation. Wikidata:Rollbackers. https://www.wikidata.org/wiki/Wikidata:Rollbackers, 2016.Google ScholarGoogle Scholar

Index Terms

  1. Vandalism Detection in Wikidata

                          Recommendations

                          Comments

                          Login options

                          Check if you have access through your login credentials or your institution to get full access on this article.

                          Sign in
                          • Published in

                            cover image ACM Conferences
                            CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
                            October 2016
                            2566 pages
                            ISBN:9781450340731
                            DOI:10.1145/2983323

                            Copyright © 2016 ACM

                            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                            Publisher

                            Association for Computing Machinery

                            New York, NY, United States

                            Publication History

                            • Published: 24 October 2016

                            Permissions

                            Request permissions about this article.

                            Request Permissions

                            Check for updates

                            Qualifiers

                            • research-article

                            Acceptance Rates

                            CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%

                            Upcoming Conference

                          PDF Format

                          View or Download as a PDF file.

                          PDF

                          eReader

                          View online with eReader.

                          eReader