research-article

Vandalism Detection in Wikidata

Authors:
Stefan Heindorf

Paderborn University, Paderborn, Germany

Paderborn University, Paderborn, Germany
View Profile

,
Martin Potthast

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

,
Benno Stein

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

,
Gregor Engels

Paderborn University, Paderborn, Germany

Paderborn University, Paderborn, Germany
View Profile

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementOctober 2016Pages 327–336https://doi.org/10.1145/2983323.2983740

Published:24 October 2016Publication History

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 327–336

ABSTRACT

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).

References

B. Adler, L. de Alfaro, and I. Pye. Detecting Wikipedia Vandalism Using WikiTrust. CLEF Notebooks 2010.Google Scholar
B. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features. CICLing 2011. Google ScholarDigital Library
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding High-Quality Content in Social Media. WSDM 2008. Google ScholarDigital Library
J. Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201: 81--105, Aug. 2013. Google ScholarDigital Library
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. SIGMOD 2008. Google ScholarDigital Library
K. Boyd, V. Santos Costa, J. Davis, C. D. Page. Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation. ICML 2012.Google Scholar
L. Breiman. Bagging Predictors. Machine learning, 24 (2): 123--140, 1996. Google ScholarDigital Library
L. Breiman. Random Forests. Machine Learning, 45 (1): 5--32, Oct. 2001. Google ScholarDigital Library
S. Burgstaller-Muehlbacher, A. Waagmeester, E. Mitraka, J. Turner, T. E. Putman, J. Leong, P. Pavlidis, L. Schriml, B. M. Good, and A. I. Su. Wikidata as a Semantic Framework for the Gene Wiki Initiative. bioRxiv 032144, 2015.Google Scholar
J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. ICML 2006. Google ScholarDigital Library
T. Fawcett. An introduction to ROC analysis. Pattern recognition letters, 27 (8): 861--874, 2006. Google ScholarDigital Library
T. Gärtner, P. Flach, A. Kowalczyk, and A. Smola. Multi-Instance Kernels. ICML 2002.Google Scholar
H. He and E. Garcia. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21 (9): 1263--1284, Sept. 2009. Google ScholarDigital Library
S. Heindorf, M. Potthast, B. Stein, and G. Engels. Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis. SIGIR 2015. Google ScholarDigital Library
IPligence. Ipligence. http://www.ipligence.com, 2014.Google Scholar
K. Y. Itakura and C. L. A. Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. SIGIR 2009. Google ScholarDigital Library
A. Ladsgroup and A. Halfaker. Wikidata features. https://github.com/wiki-ai/wb-vandalism/blob/ 31d74f8a50a8c43dd446d41cafee89ada5a051f8/wb_vandalism/feature_lists/wikidata. py.Google Scholar
B. Li, T. Jin, M. R. Lyu, I. King, and B. Mak. Analyzing and predicting question quality in community question answering services. WWW 2012. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarCross Ref
E. Mitraka, A. Waagmeester, S. Burgstaller-Muehlbacher, L. M. Schriml, A. I. Su, and B. M. Good. Wikidata: A platform for data integration and dissemination for the life sciences and beyond. bioRxiv 031971, 2015.Google Scholar
S. M. Mola-Velasco. Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals: Lab Report for PAN at CLEF 2010. CLEF Notebooks 2010.Google Scholar
P. Neis, M. Goetz, and A. Zipf. Towards Automatic Vandalism Detection in OpenStreetMap. ISPRS International Journal of Geo-Information, 2012.Google ScholarCross Ref
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825--2830, 2011. Google ScholarDigital Library
M. Potthast, B. Stein, and R. Gerling. Automatic Vandalism Detection in Wikipedia. ECIR 2008. Google ScholarDigital Library
L. Ramaswamy, R. Tummalapenta, K. Li, and C. Pu. A Content-Context-Centric Approach for Detecting Vandalism in Wikipedia. Collaboratecom 2013.Google Scholar
C. Shah and J. Pomerantz. Evaluating and Predicting Answer Quality in Community QA. SIGIR 2010. Google ScholarDigital Library
C. H. Tan, E. Agichtein, P. Ipeirotis, and E. Gabrilovich. Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation. WSDM 2014. Google ScholarDigital Library
T. P. Tanon, D. Vrandecic, S. Schaffert, T. Steiner, and L. Pintscher. From Freebase to Wikidata: The Great Migration. WWW 2016. Google ScholarDigital Library
K.-N. Tran and P. Christen. Cross Language Prediction of Vandalism on Wikipedia Using Article Views and Revisions. PAKDD 2013.Google Scholar
K.-N. Tran and P. Christen. Cross-Language Learning from Bots and Users to Detect Vandalism on Wikipedia. IEEE Transactions on Knowledge and Data Engineering, 27 (3): 673--685, Mar. 2015.Google ScholarDigital Library
K.-N. Tran, P. Christen, S. Sanner, and L. Xie. Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages. PAKDD 2015.Google Scholar
L. von Ahn. Offensive/Profane Word List. http://www.cs.cmu.edu/ biglou/resources/, 2009.Google Scholar
D. Vrandeičć and M. Krötzsch. Wikidata: A Free Collaborative Knowledgebase. Communications of the ACM, 2014. Google ScholarDigital Library
W. Y. Wang and K. R. McKeown. "Got You!": Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-semantic Modeling. COLING 2010. Google ScholarDigital Library
A. West and I. Lee. Multilingual Vandalism Detection Using Language-Independent & Ex Post Facto Evidence. CLEF Notebooks 2011.Google Scholar
A. G. West, S. Kannan, and I. Lee. Detecting Wikipedia Vandalism via Spatio-temporal Analysis of Revision Metadata. EUROSEC 2010. Google ScholarDigital Library
Q. Wu, D. Irani, C. Pu, and L. Ramaswamy. Elusive Vandalism Detection in Wikipedia: A Text Stability-based Approach. CIKM 2010. Google ScholarDigital Library
Wikimedia Foundation. Wikidata Abuse Filter. https://www.wikidata.org/wiki/Special:AbuseFilter, 2015.Google Scholar
Wikimedia Foundation. Objective Revision Evaluation Service. https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service, 2016.Google Scholar
Wikimedia Foundation. Wikidata:Rollbackers. https://www.wikidata.org/wiki/Wikidata:Rollbackers, 2016.Google Scholar

Index Terms

Recommendations

Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

We report on the construction of the Wikidata Vandalism Corpus WDVC-2015, the first corpus for vandalism in knowledge bases. Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia. Among Wikidata's 24 ...
Read More
Building Automated Vandalism Detection Tools for Wikidata
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

Wikidata, like Wikipedia, is a knowledge base that anyone can edit. This open collaboration model is powerful in that it reduces barriers to participation and allows a large number of people to contribute. However, it exposes the knowledge base to the ...
Read More
Vandalism Detection in OpenStreetMap via User Embeddings
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

OpenStreetMap (OSM) is a free and openly-editable database of geographic information. Over the years, OSM has evolved into the world's largest open knowledge base of geospatial data, and protecting OSM from the risk of vandalized and falsified ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data quality
knowledge base
trust
vandalism
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 757
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Vandalism Detection in Wikidata

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis

Building Automated Vandalism Detection Tools for Wikidata

Vandalism Detection in OpenStreetMap via User Embeddings