ABSTRACT
The discovery of multiword units is one of the key steps in the preprocessing of raw text. In this paper, we propose a know ledge-free approach for the discovery on such entities- It does not only outperform state-of-the-art approaches, but is also fully unsupervised. Furthermore, it does not demand the setting of any threshold, making it appropriate for usage by non-experts. The approach proposed is evaluated against five other metrics on a medical corpus.
- Y. Choueka. Looking for needles in a haystack or locating interesting collocation expressions in large textual databases. In Proceedings of the RIAO '88, pages 38--43, 1988.Google Scholar
- C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachussets, first edition, June 1999. Google ScholarDigital Library
- Frank A. Smadja. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143--177, 1993. Google ScholarDigital Library
- D. Bourigault. Lexter: A terminology extraction software for knowledge acquisition from texts. In 9th Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Canada, 1995.Google Scholar
- I. Dagan and K. Church. Termight: identifying and translating technical terminology. In Proceedings of the fourth conference on Applied natural language processing, pages 34--40, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- J. Justeson and S. Katz. Co-occurrences of antonymous adjectives and their contexts. Computational Linguistics, 17(1):1--20, 1991. Google ScholarDigital Library
- V. E. Giuliano. The interpretation of word associations. In M. E. et al Stevens, editor, Proceedings of the Symposiums on Statistical Association Methods for Mechanical Documentation, number 269, Washington D.C., 1964. NBS.Google Scholar
- J. Ferreira da Silva and G. Pereira Lopes. A local maxima method and a fair dispersion normalization for extracting multi-words units from corpora. In Sixth Meeting on Mathematics of Language, pages 369--381, Orlando, USA, 1999.Google Scholar
- L. R. Dice. Measures of the amount of ecological association between species. Ecology, 26:297--302, 1945.Google Scholar
- P. Schone. Toward Knowledge-Free Induction of Machine-Readable Dictionaries. PhD thesis, University of Colorado at Boulder, Boulder, USA, 2001. Google ScholarDigital Library
- G. Dias. Extraction Automatique dŠAssociations Lexicales à partir de Corpora. PhD thesis, New University of Lisbon (Portugal) and LIFO University of Orléans (France), Lisbon, Portugal, 2002.Google Scholar
- George A. Miller. Word-net: An on-line lexical database. International Journal of Lexicography, 3(4):235--244, 1990.Google ScholarCross Ref
- Christian Charras and Thierry Lecroq. Handbook of Exact String Matching Algorithms. King's College Publications, 2004. Google ScholarDigital Library
- Richard Hamming. Error-detecting and error-correcting codes. In Bell System Technical Journal, volume 29(2), pages 147--160, 1950.Google ScholarCross Ref
- Patrick Schone and Daniel Jurafsky. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pages 100--108, 2001.Google Scholar
- Kenneth W. Church and Patrick Hanks. Word association norms, mutual information, and lexicography. In Proceedings of the 27th. Annual Meeting of the Association for Computational Linguistics, pages 76--83, Vancouver, B. C., 1989. Association for Computational Linguistics. Google ScholarDigital Library
Index Terms
- Knowledge-free discovery of domain-specific multiword units
Recommendations
A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing
Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized ...
Non-Contextual vs Contextual Word Embeddings in Multiword Expressions Detection
Computational Collective IntelligenceAbstractMultiword Expression (MWE) detection is a crucial problem for many NLP applications. Recent methods approach it as a sequence labeling task and require manually annotated corpus. Traditional methods are based on statistical association measures ...
Comments