skip to main content
10.1145/2983323.2983808acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections

Finding News Citations for Wikipedia

Published: 24 October 2016 Publication History


An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether.
In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two-stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source.
We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale.


G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20.
M. Anderka, B. Stein, and N. Lipka. Predicting quality flaws in user-generated content: the case of wikipedia. In The 35th ACM SIGIR, Portland, USA, 2012.
K. Balog and H. Ramampiaro. Cumulative citation recommendation: classification vs. ranking. In 36th ACM SIGIR, Dublin, Ireland, 2013.
K. Balog, H. Ramampiaro, N. Takhirov, and K. Nørvåg. Multi-step classification approaches to cumulative citation recommendation. In OAIR, Lisbon, Portugal, 2013.
D. Biber. Variation across speech and writing. Cambridge University Press, 1991.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2003.
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.
I. Dagan, D. Roth, M. Sammons, and F. M. Zanzotto. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1--220, 2013.
B. Fetahu, A. Anand, and A. Anand. How much is wikipedia lagging behind news. 7th ACM Web Science, 2015.
B. Fetahu, K. Markert, and A. Anand. Automated news suggestions for populating wikipedia entity pages. In 24th CIKM, Melbourne, Australia, 2015.
J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In 43rd ACL, 2005, USA.
H. Ford, S. Sen, D. R. Musicant, and N. Miller. Getting to the source: where does wikipedia get its information from? In 9th WikiSym, Hong Kong, China, 2013.
M. R. Henzinger, B. Chang, B. Milch, and S. Brin. Query-free news search. In 12th WWW, Budapest, Hungary, 2003.
R. J. Kate. A dependency-based word subsequence kernel. In 2008 EMNLP, Honolulu.
B. Luyt and D. Tan. Improving wikipedia's credibility: References and citations in a sample of history articles. JASIST, 61(4), 2010.
M. Mesgari, C. Okoli, M. Mehdi, F. Å. Nielsen, and A. Lanamäki. "the sum of all human knowledge": A systematic review of scholarly research on the content of wikipedia. JASIST, 66(2), 2015.
R. Mihalcea and P. Tarau. Textrank: Bringing order into text. In 2004 EMNLP, Barcelona, Spain.
P. Petrenz and B. Webber. Stable classification of text genres. Computational Linguistics, 37(2):385--393, 2011.
E. Pitler and K. W. Church. Using word-sense disambiguation methods to classify web queries by intent. In 2009 EMNLP, Singapore.
R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. K. Joshi, and B. L. Webber. The penn discourse treebank 2.0. In LREC. Citeseer, 2008.
C. Sauper and R. Barzilay. Automatically generating wikipedia articles: A structure-aware approach. In 47th ACL, 2009, Singapore.
S. Sharoff, Z. Wu, and K. Markert. The web library of babel: evaluating genre collections. In LREC. Citeseer, 2010.
J. Strötgen and M. Gertz. Heideltime: High quality rule-based extraction and normalization of temporal expressions. In 5th SemEval, Stroudsburg, PA, USA, 2010.
K. Toutanova and C. D. Manning. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In SIGDAT, pages 63--70. ACL, 2000.

Cited By

View all
  • (2024)The Most Cited Scientific Information Sources in Wikipedia Articles Across Various LanguagesBiblioteka10.14746/b.2023.27.12(269-294)Online publication date: 7-Mar-2024
  • (2024)Unifying Corroborative and Contributive Attributions in Large Language Models2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)10.1109/SaTML59370.2024.00039(665-683)Online publication date: 9-Apr-2024
  • (2024)Polarization and reliability of news sources in WikipediaOnline Information Review10.1108/OIR-02-2023-008448:5(908-925)Online publication date: 18-Jan-2024
  • Show More Cited By

Index Terms

  1. Finding News Citations for Wikipedia



    Information & Contributors


    Published In

    cover image ACM Conferences
    CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
    October 2016
    2566 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]



    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2016


    Request permissions for this article.

    Check for updates

    Author Tags

    1. missing citations
    2. news citations
    3. wikipedia
    4. wikipedia enrichment


    • Research-article


    CIKM'16: ACM Conference on Information and Knowledge Management
    October 24 - 28, 2016
    Indiana, Indianapolis, USA

    Acceptance Rates

    CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Feb 2025

    Other Metrics


    Cited By

    View all
    • (2024)The Most Cited Scientific Information Sources in Wikipedia Articles Across Various LanguagesBiblioteka10.14746/b.2023.27.12(269-294)Online publication date: 7-Mar-2024
    • (2024)Unifying Corroborative and Contributive Attributions in Large Language Models2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)10.1109/SaTML59370.2024.00039(665-683)Online publication date: 9-Apr-2024
    • (2024)Polarization and reliability of news sources in WikipediaOnline Information Review10.1108/OIR-02-2023-008448:5(908-925)Online publication date: 18-Jan-2024
    • (2023)Companies in Multilingual Wikipedia: Articles Quality and Important Sources of InformationInformation Technology for Management: Approaches to Improving Business and Society10.1007/978-3-031-29570-6_3(48-67)Online publication date: 28-Mar-2023
    • (2022)Countering Disinformation by Finding Reliable Sources: a Citation-Based Approach2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9891941(1-8)Online publication date: 18-Jul-2022
    • (2021)Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English WikipediaQuantitative Science Studies10.1162/qss_a_001052:1(1-19)Online publication date: 8-Apr-2021
    • (2021)Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid ApproachJournal of Data and Information Quality10.1145/348482813:4(1-35)Online publication date: 15-Oct-2021
    • (2021)How Inclusive Are Wikipedia’s Hyperlinks in Articles Covering Polarizing Topics?2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671943(1300-1307)Online publication date: 15-Dec-2021
    • (2021)Discovering communities based on mention distanceScientometrics10.1007/s11192-021-03863-9Online publication date: 6-Feb-2021
    • (2020)Modeling Popularity and Reliability of Sources in Multilingual WikipediaInformation10.3390/info1105026311:5(263)Online publication date: 13-May-2020
    • Show More Cited By

    View Options

    Login options

    View options


    View or Download as a PDF file.



    View online with eReader.







    Share this Publication link

    Share on social media