skip to main content
research-article

MultiWiki: Interlingual Text Passage Alignment in Wikipedia

Published: 04 April 2017 Publication History

Abstract

In this article, we address the problem of text passage alignment across interlingual article pairs in Wikipedia. We develop methods that enable the identification and interlinking of text passages written in different languages and containing overlapping information. Interlingual text passage alignment can enable Wikipedia editors and readers to better understand language-specific context of entities, provide valuable insights in cultural differences, and build a basis for qualitative analysis of the articles. An important challenge in this context is the tradeoff between the granularity of the extracted text passages and the precision of the alignment. Whereas short text passages can result in more precise alignment, longer text passages can facilitate a better overview of the differences in an article pair. To better understand these aspects from the user perspective, we conduct a user study at the example of the German, Russian, and English Wikipedia and collect a user-annotated benchmark. Then we propose MultiWiki, a method that adopts an integrated approach to the text passage alignment using semantic similarity measures and greedy algorithms and achieves precise results with respect to the user-defined alignment. The MultiWiki demonstration is publicly available and currently supports four language pairs.

References

[1]
Sisay Fissaha Adafre and Maarten De Rijke. 2006. Finding similar sentences across multiple languages in wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06). 62--69.
[2]
Eytan Adar, Michael Skinner, and Daniel S. Weld. 2009. Information arbitrage across multi-lingual wikipedia. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). ACM, New York, NY, 94--103.
[3]
Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation. 497--511.
[4]
Salha Alzahrani, Naomie Salim, Chow Kok Kent, Mohammed Salem Binwahlan, and Ladda Suanmali. 2010. The development of cross-language plagiarism detection tool utilising fuzzy swarm-based summarisation. In Proceedings of the 10th International Conference on Intelligent Systems Design and Applications (ISDA’10). Cairo, Egypt, 86--90.
[5]
Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn, and Darren Gergle. 2012. Omnipedia: Bridging the wikipedia language gap. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). ACM, New York, NY, 1075--1084.
[6]
Elena Baralis, Luca Cagliero, Alessandro Fiori, and Paolo Garza. 2015. MWI-Sum: A multilingual summarizer based on frequent weighted itemsets. ACM Trans. Inf. Syst. 34, 1, Article 5 (Sept. 2015), 35 pages.
[7]
Alberto Barrón-Cedeño, Monica Lestari Paramita, Paul Clough, and Paolo Rosso. 2014. A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In Advances in Information Retrieval.
[8]
Tommy W. S. Chow and M. K. M. Rahman. 2009. Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Trans. Neur. Netw. (2009).
[9]
Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2013. Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13), 1144--1150.
[10]
Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 659--666.
[11]
Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS’13). ACM, New York, NY, 121--124.
[12]
Kevin Duh, Ching-Man Au Yeung, Tomoharu Iwata, and Masaaki Nagata. 2013. Managing information disparity in multilingual document collections. ACM Trans. Speech Lang. Process. 10, 1, Article 1 (Mar. 2013), 28 pages.
[13]
Manaal Faruqui and Shankar Kumar. 2015. Multilingual open relation extraction using cross-lingual projection. In HLT-NAACL, Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar (Eds.). The Association for Computational Linguistics, Stroudsburg, PA, 1351--1356.
[14]
Elena Filatova. 2009. Directions for exploiting asymmetries in multilingual wikipedia. In Proceedings of the 3rd International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3’09). Association for Computational Linguistics, Stroudsburg, PA, 30--37.
[15]
Simon Gottschalk and Elena Demidova. 2016. Analysing temporal evolution of interlingual wikipedia article pairs. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). ACM, New York, NY, 1089--1092.
[16]
Ankush Gupta and Kiran Pala. 2012. A generic and robust algorithm for paragraph alignment and its impact on sentence alignment in parallel corpora. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation (WILDRE’12). 18--27.
[17]
Kilem L Gwet. 2014. Handbook of inter-rater reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC.
[18]
Scott A. Hale. 2014. Multilinguals and wikipedia editing. In Proceedings of the 2014 ACM Conference on Web Science (WebSci’14). ACM, New York, NY, 99--108.
[19]
Marti A. Hearst. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 1 (Mar. 1997), 33--64.
[20]
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit (MT Summit’05). AAMT, 79--86.
[21]
J. R. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159--174.
[22]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY.
[23]
Paolo Massa and Federico Scrinzi. 2012. Manypedia: Comparing language points of view of wikipedia communities. In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym’12). ACM, New York, NY, Article 21, 9 pages.
[24]
Mehdi Mohammadi and Nasser Ghasem-Aghaee. 2010. Building bilingual parallel corpora based on wikipedia. In Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 02 (ICCEA’10). IEEE Computer Society, Washington, DC, 264--268.
[25]
Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Trans. Assoc. Comput. Ling. 2 (2014), 231--244.
[26]
Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2011. Cross lingual text classification by mining multilingual topics from wikipedia. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 375--384.
[27]
Monica Lestari Paramita, Paul D. Clough, Ahmet Aker, and Robert J. Gaizauskas. 2012. Correlation between similarity measures for inter-language linked wikipedia articles. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), 790--797.
[28]
Mohammad Taher Pilehvar, David Jurgens, and Roberto Navigli. 2013. Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13), 1341--1351.
[29]
Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). Association for Computational Linguistics, Stroudsburg, PA, 997--1005.
[30]
Mohammad Sadegh Rasooli, Omid Kashefi, and Behrouz Minaei-Bidgoli. 2011. Extracting parallel paragraphs and sentences from english-persian translated documents. In Proceedings of the 7th Asia Conference on Information Retrieval Technology (AIRS’11). Springer-Verlag, Berlin, 574--583.
[31]
Richard Rogers. 2013. Digital Methods. The MIT Press, Chapter Wikipedia as Cultural Reference.
[32]
Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov. 2015. Adaptive algorithm for plagiarism detection: The best-performing approach at pan 2014 text alignment competition. In Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction, Volume 9283 (CLEF’15). Springer-Verlag, New York, NY, 402--413.
[33]
Jangwon Seo and W. Bruce Croft. 2008. Local text reuse detection. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 571--578.
[34]
Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT’10). Association for Computational Linguistics, Stroudsburg, PA, 403--411.
[35]
Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma Erjavec, and Dan Tufi. 2006. The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 2142--2147.
[36]
Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain temporal tagging. Lang. Res. Eval. 47, 2 (2013), 269--298.
[37]
Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). ACM, New York, NY, 363--372.
[38]
Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing wikipedia across languages via recommendation. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 975--985.
[39]
Taha Yasseri, Anselm Spoerri, Mark Graham, and Janos Kertesz. 2014. The most controversial topics in wikipedia: A multilingual and geographical analysis. In Global Wikipedia: International and Cross-cultural Issues in Online Collaboration. Scarecrow Press.

Cited By

View all
  • (2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
  • (2024)Event Analysis Through QuoteKG: A Multilingual Knowledge Graph of QuotesEvent Analytics across Languages and Communities10.1007/978-3-031-64451-1_7(123-148)Online publication date: 17-Jun-2024
  • (2023)LaSERWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2022.10075975:COnline publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 11, Issue 1
February 2017
203 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/3062397
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017
Accepted: 01 November 2016
Revised: 01 November 2016
Received: 01 June 2016
Published in TWEB Volume 11, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Interlingual text alignment
  2. wikipedia

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • WDAqua
  • COST Action IC1302 (KEYSTONE)
  • ERC under ALEXANDRIA (ERC 339233)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)INCEPT: A Framework for Duplicate Posts Classification with Combined Text RepresentationsACM Transactions on the Web10.1145/367732218:3(1-24)Online publication date: 15-Jul-2024
  • (2024)Event Analysis Through QuoteKG: A Multilingual Knowledge Graph of QuotesEvent Analytics across Languages and Communities10.1007/978-3-031-64451-1_7(123-148)Online publication date: 17-Jun-2024
  • (2023)LaSERWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2022.10075975:COnline publication date: 1-Jan-2023
  • (2022)User Access Models to Event-Centric InformationCompanion Proceedings of the Web Conference 202210.1145/3487553.3524193(329-333)Online publication date: 25-Apr-2022
  • (2022)Crosslingual Section Title Alignment in Wikipedia2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020462(5892-5901)Online publication date: 17-Dec-2022
  • (2022)QuoteKG: A Multilingual Knowledge Graph of QuotesThe Semantic Web10.1007/978-3-031-06981-9_21(353-369)Online publication date: 31-May-2022
  • (2021)WDProp: Web Application to Analyse Multilingual Aspects of Wikidata PropertiesProceedings of the 17th International Symposium on Open Collaboration10.1145/3479986.3479996(1-12)Online publication date: 15-Sep-2021
  • (2019)EventKG – the hub of event knowledge on the web – and biographical timeline generationSemantic Web10.3233/SW-19035510:6(1039-1070)Online publication date: 1-Jan-2019
  • (2019)Crosslingual Document Embedding as Reduced-Rank Ridge RegressionProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291023(744-752)Online publication date: 30-Jan-2019
  • (2018)Analyzing and Visualizing Translation Patterns of Wikidata PropertiesExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-319-98932-7_12(128-134)Online publication date: 15-Aug-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media