research-article

MultiWiki: Interlingual Text Passage Alignment in Wikipedia

Authors:

Simon Gottschalk,

Elena DemidovaAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 11, Issue 1

Article No.: 6, Pages 1 - 30

https://doi.org/10.1145/3004296

Published: 04 April 2017 Publication History

Abstract

In this article, we address the problem of text passage alignment across interlingual article pairs in Wikipedia. We develop methods that enable the identification and interlinking of text passages written in different languages and containing overlapping information. Interlingual text passage alignment can enable Wikipedia editors and readers to better understand language-specific context of entities, provide valuable insights in cultural differences, and build a basis for qualitative analysis of the articles. An important challenge in this context is the tradeoff between the granularity of the extracted text passages and the precision of the alignment. Whereas short text passages can result in more precise alignment, longer text passages can facilitate a better overview of the differences in an article pair. To better understand these aspects from the user perspective, we conduct a user study at the example of the German, Russian, and English Wikipedia and collect a user-annotated benchmark. Then we propose MultiWiki, a method that adopts an integrated approach to the text passage alignment using semantic similarity measures and greedy algorithms and achieves precise results with respect to the user-defined alignment. The MultiWiki demonstration is publicly available and currently supports four language pairs.

References

[1]

Sisay Fissaha Adafre and Maarten De Rijke. 2006. Finding similar sentences across multiple languages in wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06). 62--69.

[2]

Eytan Adar, Michael Skinner, and Daniel S. Weld. 2009. Information arbitrage across multi-lingual wikipedia. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). ACM, New York, NY, 94--103.

Digital Library

[3]

Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation. 497--511.

[4]

Salha Alzahrani, Naomie Salim, Chow Kok Kent, Mohammed Salem Binwahlan, and Ladda Suanmali. 2010. The development of cross-language plagiarism detection tool utilising fuzzy swarm-based summarisation. In Proceedings of the 10th International Conference on Intelligent Systems Design and Applications (ISDA’10). Cairo, Egypt, 86--90.

[5]

Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn, and Darren Gergle. 2012. Omnipedia: Bridging the wikipedia language gap. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). ACM, New York, NY, 1075--1084.

Digital Library

[6]

Elena Baralis, Luca Cagliero, Alessandro Fiori, and Paolo Garza. 2015. MWI-Sum: A multilingual summarizer based on frequent weighted itemsets. ACM Trans. Inf. Syst. 34, 1, Article 5 (Sept. 2015), 35 pages.

Digital Library

[7]

Alberto Barrón-Cedeño, Monica Lestari Paramita, Paul Clough, and Paolo Rosso. 2014. A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval.

[8]

Tommy W. S. Chow and M. K. M. Rahman. 2009. Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Trans. Neur. Netw. (2009).

Digital Library

[9]

Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2013. Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13), 1144--1150.

[10]

Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 659--666.

Digital Library

[11]

Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS’13). ACM, New York, NY, 121--124.

Digital Library

[12]

Kevin Duh, Ching-Man Au Yeung, Tomoharu Iwata, and Masaaki Nagata. 2013. Managing information disparity in multilingual document collections. ACM Trans. Speech Lang. Process. 10, 1, Article 1 (Mar. 2013), 28 pages.

Digital Library

[13]

Manaal Faruqui and Shankar Kumar. 2015. Multilingual open relation extraction using cross-lingual projection. In HLT-NAACL, Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar (Eds.). The Association for Computational Linguistics, Stroudsburg, PA, 1351--1356.

[14]

Elena Filatova. 2009. Directions for exploiting asymmetries in multilingual wikipedia. In Proceedings of the 3rd International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3’09). Association for Computational Linguistics, Stroudsburg, PA, 30--37.

Digital Library

[15]

Simon Gottschalk and Elena Demidova. 2016. Analysing temporal evolution of interlingual wikipedia article pairs. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). ACM, New York, NY, 1089--1092.

Digital Library

[16]

Ankush Gupta and Kiran Pala. 2012. A generic and robust algorithm for paragraph alignment and its impact on sentence alignment in parallel corpora. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation (WILDRE’12). 18--27.

[17]

Kilem L Gwet. 2014. Handbook of inter-rater reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC.

[18]

Scott A. Hale. 2014. Multilinguals and wikipedia editing. In Proceedings of the 2014 ACM Conference on Web Science (WebSci’14). ACM, New York, NY, 99--108.

Digital Library

[19]

Marti A. Hearst. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 1 (Mar. 1997), 33--64.

Digital Library

[20]

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit (MT Summit’05). AAMT, 79--86.

[21]

J. R. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159--174.

[22]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY.

Digital Library

[23]

Paolo Massa and Federico Scrinzi. 2012. Manypedia: Comparing language points of view of wikipedia communities. In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym’12). ACM, New York, NY, Article 21, 9 pages.

Digital Library

[24]

Mehdi Mohammadi and Nasser Ghasem-Aghaee. 2010. Building bilingual parallel corpora based on wikipedia. In Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 02 (ICCEA’10). IEEE Computer Society, Washington, DC, 264--268.

Digital Library

[25]

Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Trans. Assoc. Comput. Ling. 2 (2014), 231--244.

[26]

Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2011. Cross lingual text classification by mining multilingual topics from wikipedia. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 375--384.

Digital Library

[27]

Monica Lestari Paramita, Paul D. Clough, Ahmet Aker, and Robert J. Gaizauskas. 2012. Correlation between similarity measures for inter-language linked wikipedia articles. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), 790--797.

[28]

Mohammad Taher Pilehvar, David Jurgens, and Roberto Navigli. 2013. Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13), 1341--1351.

[29]

Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). Association for Computational Linguistics, Stroudsburg, PA, 997--1005.

Digital Library

[30]

Mohammad Sadegh Rasooli, Omid Kashefi, and Behrouz Minaei-Bidgoli. 2011. Extracting parallel paragraphs and sentences from english-persian translated documents. In Proceedings of the 7th Asia Conference on Information Retrieval Technology (AIRS’11). Springer-Verlag, Berlin, 574--583.

Digital Library

[31]

Richard Rogers. 2013. Digital Methods. The MIT Press, Chapter Wikipedia as Cultural Reference.

[32]

Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov. 2015. Adaptive algorithm for plagiarism detection: The best-performing approach at pan 2014 text alignment competition. In Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction, Volume 9283 (CLEF’15). Springer-Verlag, New York, NY, 402--413.

Digital Library

[33]

Jangwon Seo and W. Bruce Croft. 2008. Local text reuse detection. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 571--578.

Digital Library

[34]

Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT’10). Association for Computational Linguistics, Stroudsburg, PA, 403--411.

Digital Library

[35]

Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma Erjavec, and Dan Tufi. 2006. The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 2142--2147.

[36]

Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain temporal tagging. Lang. Res. Eval. 47, 2 (2013), 269--298.

[37]

Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). ACM, New York, NY, 363--372.

Digital Library

[38]

Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing wikipedia across languages via recommendation. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 975--985.

Digital Library

[39]

Taha Yasseri, Anselm Spoerri, Mark Graham, and Janos Kertesz. 2014. The most controversial topics in wikipedia: A multilingual and geographical analysis. In Global Wikipedia: International and Cross-cultural Issues in Online Collaboration. Scarecrow Press.

Cited By

Skenderi EHuhtamäki JLaaksonen SStefanidis K(2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1145/3677322
Kuculo TGottschalk S(2024)Event Analysis Through QuoteKG: A Multilingual Knowledge Graph of QuotesEvent Analytics across Languages and Communities10.1007/978-3-031-64451-1_7(123-148)Online publication date: 17-Jun-2024
https://doi.org/10.1007/978-3-031-64451-1_7
Abdollahi SGottschalk SDemidova E(2023)LaSERWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2022.10075975:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.websem.2022.100759
Show More Cited By

Index Terms

MultiWiki: Interlingual Text Passage Alignment in Wikipedia
1. Information systems
  1. Information systems applications
    1. Collaborative and social computing systems and tools
      1. Wikis
  2. World Wide Web
    1. Web applications

Recommendations

DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The ...
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Named entity recognition an aid to improve multilingual entity filling in language-independent approach
IKM4DR '12: Proceedings of the first workshop on Information and knowledge management for developing region

This paper details the approach to identify Named Entities (NEs) from a large non-English corpus and associate them with appropriate tags, requiring minimal human intervention and no linguistic expertise. The main objective in this paper is to focus on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 11, Issue 1

February 2017

203 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/3062397

Editors:
Brian D. Davison
Lehigh University, USA
,
Marianne Winslett
University of Illinois at Urbana-Champaign

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017

Accepted: 01 November 2016

Revised: 01 November 2016

Received: 01 June 2016

Published in TWEB Volume 11, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

WDAqua
COST Action IC1302 (KEYSTONE)
ERC under ALEXANDRIA (ERC 339233)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
249
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Skenderi EHuhtamäki JLaaksonen SStefanidis K(2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1145/3677322
Kuculo TGottschalk S(2024)Event Analysis Through QuoteKG: A Multilingual Knowledge Graph of QuotesEvent Analytics across Languages and Communities10.1007/978-3-031-64451-1_7(123-148)Online publication date: 17-Jun-2024
https://doi.org/10.1007/978-3-031-64451-1_7
Abdollahi SGottschalk SDemidova E(2023)LaSERWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2022.10075975:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.websem.2022.100759
Abdollahi S(2022)User Access Models to Event-Centric InformationCompanion Proceedings of the Web Conference 202210.1145/3487553.3524193(329-333)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3487553.3524193
Difallah DSaez-Trumper DAugustine EWest RZia L(2022)Crosslingual Section Title Alignment in Wikipedia2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020462(5892-5901)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020462
Kuculo TGottschalk SDemidova E(2022)QuoteKG: A Multilingual Knowledge Graph of QuotesThe Semantic Web10.1007/978-3-031-06981-9_21(353-369)Online publication date: 31-May-2022
https://doi.org/10.1007/978-3-031-06981-9_21
Samuel J(2021)WDProp: Web Application to Analyse Multilingual Aspects of Wikidata PropertiesProceedings of the 17th International Symposium on Open Collaboration10.1145/3479986.3479996(1-12)Online publication date: 15-Sep-2021
https://dl.acm.org/doi/10.1145/3479986.3479996
Gottschalk SDemidova EKejriwal MLopez VSequeda J(2019)EventKG – the hub of event knowledge on the web – and biographical timeline generationSemantic Web10.3233/SW-19035510:6(1039-1070)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.3233/SW-190355
Josifoski MPaskov IPaskov HJaggi MWest RCulpepper JMoffat ABennett PLerman K(2019)Crosslingual Document Embedding as Reduced-Rank Ridge RegressionProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291023(744-752)Online publication date: 30-Jan-2019
https://dl.acm.org/doi/10.1145/3289600.3291023
Samuel J(2018)Analyzing and Visualizing Translation Patterns of Wikidata PropertiesExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-319-98932-7_12(128-134)Online publication date: 15-Aug-2018
https://doi.org/10.1007/978-3-319-98932-7_12
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents