ABSTRACT
Multiview learning has been shown to be a natural and efficient framework for supervised or semi-supervised learning of multilingual document categorizers. The state-of-the-art co-regularization approach relies on alternate minimizations of a combination of language-specific categorization errors and a disagreement between the outputs of the monolingual text categorizers. This is typically solved by repeatedly training categorizers on each language with the appropriate regularizer. We extend and improve this approach by introducing an on-line learning scheme, where language-specific updates are interleaved in order to iteratively optimize the global cost in one pass. Our experimental results show that this produces similar performance as the batch approach, at a fraction of the computational cost.
- M.-R. Amini and C. Goutte. A co-classification approach to learning from multilingual corpora. Machine Learning, 79(1--2), 2010. Google ScholarDigital Library
- M. R. Amini, C. Goutte, and N. Usunier. Combining coregularization and consensus-based self-training for multilingual text categorization. In SIGIR'10, 2010. Google ScholarDigital Library
- L. Bottou and Y. LeCun. Large scale online learning. In NIPS 16, 2004.Google Scholar
- A. Eisele and Y. Chen. MultiUN: A multilingual corpus from united nation documents. In LREC'10, 2010.Google Scholar
- D. D. Lewis, Y. Yang, T. Rose, and F. Li. A new benchmark collection for text categorization research. J. Machine Learning Research, 5:361--397, 2004. Google ScholarDigital Library
- B. Pouliquen, R. Steinberger, and C. Ignat. Automatic annotation of multilingual text collections with a conceptual thesaurus. CoRR, abs/cs/0609059, 2006.Google Scholar
Index Terms
- Fast on-line learning for multilingual categorization
Recommendations
Multilingual sentence categorization and novelty mining
A challenge for sentence categorization and novelty mining is to detect not only when text is relevant to the user's information need, but also when it contains something new which the user has not seen before. It involves two tasks that need to be ...
Unsupervised multilingual learning for POS tagging
EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language ProcessingWe demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The key hypothesis of multilingual learning is that by combining cues from multiple languages, the structure of each becomes more apparent. We formulate a ...
Comments