ABSTRACT
We describe a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semantically) using a language model. The entire network is trained jointly on all these tasks using weight-sharing, an instance of multitask learning. All the tasks use labeled data except the language model which is learnt from unlabeled text and represents a novel form of semi-supervised learning for the shared tasks. We show how both multitask learning and semi-supervised learning improve the generalization of the shared tasks, resulting in state-of-the-art-performance.
- Ando, R., & Zhang, T. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. JMLR, 6, 1817--1853. Google ScholarDigital Library
- Bengio, Y., & Ducharme, R. (2001). A neural probabilistic language model. NIPS 13.Google Scholar
- Bridle, J. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. F. Soulié and J. Hérault (Eds.), Neurocomputing: Algorithms, architectures and applications, 227--236. NATO ASI Series.Google Scholar
- Caruana, R. (1997). Multitask Learning. Machine Learning, 28, 41--75. Google ScholarDigital Library
- Chapelle, O., Schlkopf, B., & Zien, A. (2006). Semi-supervised learning. Adaptive computation and machine learning. Cambridge, Mass., USA: MIT Press. Google ScholarDigital Library
- Collobert, R., & Weston, J. (2007). Fast semantic extraction using a novel neural network architecture. Proceedings of the 45th Annual Meeting of the ACL (pp. 560--567).Google Scholar
- Gildea, D., & Palmer, M. (2001). The necessity of parsing for predicate argument recognition. Proceedings of the 40th Annual Meeting of the ACL, 239--246. Google ScholarDigital Library
- Joachims, T. (1999). Transductive inference for text classification using support vector machines. ICML. Google ScholarDigital Library
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86.Google ScholarCross Ref
- McClosky, D., Charniak, E., & Johnson, M. (2006). Effective self-training for parsing. Proceedings of HLTNAACL 2006. Google ScholarDigital Library
- Miller, S., Fox, H., Ramshaw, L., & Weischedel, R. (2000). A novel use of statistical parsing to extract information from text. 6th Applied Natural Language Processing Conference. Google ScholarDigital Library
- Musillo, G., & Merlo, P. (2006). Robust Parsing of the Proposition Bank. ROMAND 2006: Robust Methods in Analysis of Natural language Data.Google Scholar
- Okanohara, D., & Tsujii, J. (2007). A discriminative language model with pseudo-negative samples. Proceedings of the 45th Annual Meeting of the ACL, 73--80.Google Scholar
- Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Comput. Linguist., 31, 71--106. Google ScholarDigital Library
- Pradhan, S., Ward, W., Hacioglu, K., Martin, J., & Jurafsky, D. (2004). Shallow semantic parsing using support vector machines. Proceedings of HLT/NAACL-2004.Google Scholar
- Rosenfeld, B., & Feldman, R. (2007). Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web. Proceedings of the 45th Annual Meeting of the ACL, 600--607.Google Scholar
- Schwenk, H., & Gauvain, J. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 765--768).Google Scholar
- Sutton, C., & McCallum, A. (2005a). Composition of conditional random fields for transfer learning. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 748--754. Google ScholarDigital Library
- Sutton, C., & McCallum, A. (2005b). Joint parsing and semantic role labeling. Proceedings of CoNLL-2005 (pp. 225--228). Google ScholarDigital Library
- Sutton, C., McCallum, A., & Rohanimanesh, K. (2007). Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. JMLR, 8, 693--723. Google ScholarDigital Library
- Ueffing, N., Haffari, G., & Sarkar, A. (2007). Transductive learning for statistical machine translation. Proceedings of the 45th Annual Meeting of the ACL, 25--32.Google Scholar
- Waibel, A., abd G. Hinton, T. H., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 328--339.Google ScholarCross Ref
Index Terms
- A unified architecture for natural language processing: deep neural networks with multitask learning
Comments