Wikidata Vandalism Detection

This website collects all the material of our research in one central place.

Vandalism Corpora

Vandalism Corpus WDVC-2016

The Wikidata Vandalism Corpus WDVC-2016 is a corpus for training, validating, and testing automatic vandalism detectors at Wikidata. It contains 83 million revisions of which about 200,000 are labeled as vandalism.

The corpus is available as part of the WSDM Cup 2017.

Publications

Stefan Heindorf, Martin Potthast, Gregor Engels, and Benno Stein. Overview of the Wikidata Vandalism Detection Task at WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, 2017. [BibTex] [Paper]

Stefan Heindorf, Martin Potthast, Hannah Bast, Björn Buchhold, and Elmar Haussmann. WSDM Cup 2017: Vandalism Detection and Triple Scoring. In WSDM, pages 827-828. ACM, 2017. [BibTex] [Paper]

Vandalism Corpus WDVC-2015

The Wikidata Vandalism Corpus WDVC-2015 is a corpus for training, validating, and testing automatic vandalism detectors at Wikidata. It contains 24 million revisions of which about 100,000 are labeled as vandalism.

If you use the corpus in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [BibTex].

Download from Zenodo (4.5 GB)

Wikidata Vandalism Corpus 2015 by Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels is licensed under a Creative Commons Attribution 4.0 International License.

Publication

Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis. In SIGIR, pages 831-834. ACM, 2015. [BibTex] [Paper] [Poster]

Vandalism Detectors

Vandalism Detectors with Low Bias

The Wikidata vandalism detectors FAIR-E and FAIR-S are machine learning-based approaches for automatic vandalism detection in Wikidata that reduce biases against anonymous editors.

The source code is available here:

The data is available here:

Publications

Stefan Heindorf, Yan Scholten, Gregor Engels, and Martin Potthast. Debiasing Vandalism Detection Models at Wikidata. In WWW, pages 670-680. ACM, 2019. [BibTex] [Paper] [Code]

Stefan Heindorf, Yan Scholten, Gregor Engels, and Martin Potthast. Debiasing Vandalism Detection Models at Wikidata (Extended Abstract). In INFORMATIK, pages 289-290, 2019. [BibTex] [Paper] [Code]

Vandalism Detectors with High Predictive Performance at WSDM Cup 2017

The Wikidata Vandalism Detector WDVD is a machine learning-based approach for automatic vandalism detection in Wikidata that was employed in the WSDM Cup 2017 as a strong baseline.

The source code is available here:

The data is available here:

Publications

Stefan Heindorf, Martin Potthast, Gregor Engels, and Benno Stein. Overview of the Wikidata Vandalism Detection Task at WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, 2017. [BibTex] [Paper] [Code]

Stefan Heindorf, Martin Potthast, Hannah Bast, Björn Buchhold, and Elmar Haussmann. WSDM Cup 2017: Vandalism Detection and Triple Scoring. In WSDM, pages 827-828. ACM, 2017. [BibTex] [Paper]

Vandalism Detectors with High Predictive Performance

The Wikidata Vandalism Detector WDVD is a machine learning-based approach for automatic vandalism detection in Wikidata.

The source code is available here:

The data is available here:

Publication

Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. Vandalism Detection in Wikidata. In CIKM, pages 327-336. ACM, 2016. [BibTex] [Paper] [Slides] [Code]