ABSTRACT
Text mining on a lexical basis is quite well developed for the English language. In compounding languages, however, lexicalized words are often a combination of two or more semantic units. New words can be built easily by concatenating existing ones, without putting any white spaces in between.
That poses a problem to existing search algorithms: Such compounds could be of high interest for a search request, but how can be examined whether a compound comprises a given lexeme? A string match can be considered as an indication, but does not prove semantic relation. The same problem is faced when using lexicon based approaches where signal words are defined as lexemes only and need to be identified in all forms of appearance, and hence also as component of a compound. This paper explores the characteristics of compounds and their constituent elements for German, and compares seven algorithms with regard to runtime and error rates. The results of this study are relevant to query analysis and term weighting approaches in information retrieval system design.
- Alfonseca, E., Bilac, S., Pharies, S.: Decompounding query keywords from compounding languages, in: Proceedings of ACL-08, Columbus, 2008, pp. 253--256. Google ScholarDigital Library
- Alfonseca, E., Bilac, S., Pharies, S.: German Decompounding in a Difficult Corpus, Springer, Berlin, 2008, pp. 128--139. Google ScholarDigital Library
- Baccianella, S., Esuli, A., Sebastiani, F.: SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. Proceedings of the 7th conference on International Language Resources and Evaluation LREC10, pp. 2200--2204, 2008.Google Scholar
- Bozsahin, C.: The Combinatory Morphemic Lexicon, Middle East Technical University, Turkey, 2002. Google ScholarDigital Library
- Braschler, M., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval?, in: Information Retrieval 7, 2004, pp. 291--316. Google ScholarDigital Library
- Brown, R. D.: Corpus-Driven Splitting of Compound Words, in: Proceedings of the TMI 2002, Keihanna, Japan, 2011, pp. 12--21.Google Scholar
- Canoo Engineering AG: Deutsche Wörterbücher und Grammatik. Available at: http://www.canoo.net/services/WordformationRules/ueberblick/ (Feb. 2012)Google Scholar
- Carstensen, K.-U., Ebert, Ch., Ebert, C., Jekat, S., Langer, H. and Klabunde, R.: Computerlinguistik und Sprachtechnologie, Elsevier, München, 2009.Google Scholar
- Geyken, A. and Hanneforth, T.: TAGH: A Complete Morphology for German Based on Weighted Finite State Automata. In Proceedings of the FSMNLP 2005, Springer, Berlin, 2006, 55--66.Google ScholarCross Ref
- Gupta, G. K.: Introduction to Data Mining with Case Studies, Prentice-Hall of India, New Delhi, 2006.Google Scholar
- Hess, W.: Grundlagen der Phonetik, Rheinische Friedrich Wilhelms Universität Bonn, 2001.Google Scholar
- Holz, F., Biemann, C.: Unsupervised and Knowledge-free Learning of Compound Splits and Periphrases, Springer, Berlin, 2008, pp. 117--127. Google ScholarDigital Library
- Ingason, A. K., Helgadóttir, S., Loftsson, H. and Rögnvaldsson, E.: A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI). In Proceedings of GoTAL 2008, LNAI vol. 5221. Berlin: Springer, 2008, pp. 205--216. Google ScholarDigital Library
- Jürgenson, I. B.: Neuronale Korrelate phonotaktischer Verarbeitung, Dissertation: Universitätsmedizin Berlin, 2009.Google Scholar
- Kellner, G.: Wege der Kommunikationsoptimierung. Anwendung von NLP im Bereich der Künstlichen Intelligenz. VDM, Saarbrücken, 2010.Google Scholar
- Kellner, G., Berendt, B.: Extracting Knowledge about Cognitive Style. The Use of Sensory Vocabulary in Forums: A Text Mining Approach, in Proceedings of the NLPKE 2011, IEEE Press, 2011.Google Scholar
- Macherey, K., Dai, A. M., Talbot, D., Popat, A. C. and Och, F.: Language-independent compound splitting with morphological operations, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011; pp. 1395--1404. Google ScholarDigital Library
- Porter, M. F.: An algorithm for suffix stripping, in: Program, Nr. 3, 1980, pp. 130--137.Google Scholar
- Porter, M. F., Boulton, R., Miles, P. et al.: Snowball Project: German Stemming Algorithm. Available at: http://snowball.tartarus.org/algorithms/german/stemmer.html (Feb. 2012)Google Scholar
- Protaziuk, G., Kryszkiewicz, M., Rybinski, H., Delteil, A.: Discovering Compound and Proper Nouns, Springer, Berlin, 2007, pp. 505--515. Google ScholarDigital Library
- Stymne, S.: German Compounds in Factored Statistical Machine Translation, in Proceedings of GoTAL, 6th International Conference on Natural Language Processing, Springer LNCS/LNAI Vol. 5221, 2008, pp. 464--475. Google ScholarDigital Library
- Williams, E.: On the Notions "Lexically Related" and "Head of a Word". Linguistic Inquiry 12/2, 1981, pp. 245--274.Google Scholar
- Zemb, J. M.: Vergleichende Grammatik Französisch--Deutsch. Part 1: Comparaison de deux systèmes. Part 2: L'économie de la langue et le jeu de la parole. Duden, Mannheim, 1984.Google Scholar
Index Terms
- Algorithms for the verification of the semantic relation between a compound and a given lexeme
Recommendations
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this article: 1) How to ...
Comments