ABSTRACT
The lingual barrier limits the ability of millions of non-English speaking developers to make effective use of the tremendous knowledge in Stack Overflow, which is archived in English. For cross-lingual question retrieval, one may use translation-based methods that first translate the non-English queries into English and then perform monolingual question retrieval in English. However, translation-based methods suffer from semantic deviation due to inappropriate translation, especially for domain-specific terms, and lexical gap between queries and questions that share few words in common. To overcome the above issues, we propose a novel cross-lingual question retrieval based on word embeddings and convolutional neural network (CNN) which are the state-of-the-art deep learning techniques to capture word- and sentence-level semantics. The CNN model is trained with large amounts of examples from Stack Overflow duplicate questions and their corresponding translation by machine, which guides the CNN to learn to capture informative word and sentence features to recognize and quantify semantic similarity in the presence of semantic deviations and lexical gaps. A uniqueness of our approach is that the trained CNN can map documents in two languages (e.g., Chinese queries and English questions) in a dual-language vector space, and thus reduce the cross-lingual question retrieval problem to a simple k-nearest neighbors search problem in the dual-language vector space, where no query or question translation is required. Our evaluation shows that our approach significantly outperforms the translation-based method, and can be extended to dual-language documents retrieval from different sources.
- Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. In Innovations in Machine Learning, pages 137–186. Springer, 2006.Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. Google ScholarDigital Library
- D. Bogdanova, C. dos Santos, L. Barbosa, and B. Zadrozny. Detecting semantically equivalent questions in online user forums. CoNLL 2015, page 123, 2015.Google Scholar
- G. Capobianco, A. D. Lucia, R. Oliveto, A. Panichella, and S. Panichella. Improving ir-based traceability recovery via noun-based indexing of software artifacts. Journal of Software: Evolution and Process, 25(7):743–762, 2013.Google ScholarCross Ref
- C. Chen, S. Gao, and Z. Xing. Mining analogical libraries in q&a discussions -incorporating relational and categorical knowledge into word embedding. In 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER), pages 338–348. IEEE, 2016.Google Scholar
- R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008. Google ScholarDigital Library
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537, 2011. Google ScholarDigital Library
- D. H. Dalip, M. A. Gon¸ calves, M. Cristo, and P. Calado. Exploiting user feedback to learn to rank answers in q&a forums: a case study with stack overflow. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 543–552. ACM, 2013. Google ScholarDigital Library
- B. Dit, M. Revelle, and D. Poshyvanyk. Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empirical Software Engineering, 18(2):277–309, 2013. Google ScholarDigital Library
- C. N. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In COLING, pages 69–78, 2014.Google Scholar
- A. Eisele and Y. Chen. Multiun: A multilingual corpus from united nation documents. In LREC, 2010.Google Scholar
- J. H. Hayes, H. Sultanov, W.-K. Kong, and W. Li. Software verification and validation research laboratory (svvrl) of the university of kentucky: traceability challenge 2011: language translation. In Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering, pages 50–53. ACM, 2011. Google ScholarDigital Library
- G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.Google ScholarCross Ref
- Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.Google Scholar
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.Google ScholarDigital Library
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.Google ScholarCross Ref
- O. A. L. Lemos, S. K. Bajracharya, J. Ossher, R. S. Morla, P. C. Masiero, P. Baldi, and C. V. Lopes. Codegenie: using test-cases to search and reuse source code. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 525–526. ACM, 2007. Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.Google ScholarDigital Library
- L. Mou, G. Li, Y. Liu, H. Peng, Z. Jin, Y. Xu, and L. Zhang. Building program vector representations for deep learning. arXiv preprint arXiv:1409.3358, 2014.Google Scholar
- L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.Google ScholarCross Ref
- S. Paul and A. Prakash. A framework for source code search using program patterns. Software Engineering, IEEE Transactions on, 20(6):463–475, 1994. Google ScholarDigital Library
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275–281. ACM, 1998. Google ScholarDigital Library
- S. Reed and N. de Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.Google Scholar
- R. ˇ Reh˚ uˇ rek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010.Google Scholar
- ELRA. http://is.muni.cz/publication/884893/en.Google Scholar
- S. P. Reiss. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, pages 243–253. IEEE Computer Society, 2009. Google ScholarDigital Library
- S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3. NIST SPECIAL PUBLICATION SP, 109:109, 1995.Google Scholar
- M. P. Robillard and Y. B. Chhetri. Recommending reference api documentation. Empirical Software Engineering, 20(6):1558–1586, 2015. Google ScholarDigital Library
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.Google Scholar
- R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry. Improving bug localization using structured information retrieval. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, pages 345–355. IEEE, 2013.Google Scholar
- G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. Google ScholarDigital Library
- R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, page 1642. Citeseer, 2013.Google Scholar
- K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.Google ScholarCross Ref
- S. Subramanian, L. Inozemtseva, and R. Holmes. Live api documentation. In Proceedings of the 36th International Conference on Software Engineering, pages 643–652. ACM, 2014. Google ScholarDigital Library
- J. Uszkoreit, J. M. Ponte, A. C. Popat, and M. Dubiner. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1101–1109. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85, 2008.Google Scholar
- J. Wang, X. Peng, Z. Xing, and W. Zhao. Improving feature location practice with multi-faceted interactive exploration. In Proceedings of the 2013 International Conference on Software Engineering, pages 762–771. IEEE Press, 2013. Google ScholarDigital Library
- S. Wang, D. Lo, and L. Jiang. Active code search: incorporating user feedback to improve code search relevance. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, pages 677–682. ACM, 2014. Google ScholarDigital Library
- M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk. Toward deep learning software repositories. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on, pages 334–345. IEEE, 2015. Google ScholarDigital Library
- X. Xia, D. Lo, X. Wang, C. Zhang, and X. Wang. Cross-language bug localization. In Proceedings of the 22nd International Conference on Program Comprehension, pages 275–278. ACM, 2014. Google ScholarDigital Library
- B. Xu, Z. Xing, X. Xia, D. Lo, Q. Wang, and S. Li. Domain-specific cross-language relevant question retrieval. In Proceedings of the 13th International Conference on Mining Software Repositories, page (To appear). IEEE, 2016. Google ScholarDigital Library
- Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.Google ScholarCross Ref
- Y. Zhang and B. Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. CoRR, abs/1510.03820, 2015.Google Scholar
- X. Zheng, H. Chen, and T. Xu. Deep learning for chinese word segmentation and pos tagging. In EMNLP, pages 647–657, 2013.Google Scholar
- G. Zhou, T. He, J. Zhao, and P. Hu. Learning continuous word embedding with metadata for question retrieval in community question answering. In Proceedings of ACL, pages 250–259, 2015.Google ScholarCross Ref
- Y. Zou, T. Ye, Y. Lu, J. Mylopoulos, and L. Zhang. Learning to rank for question-oriented software text retrieval (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 1–11. IEEE, 2015.Google Scholar
Index Terms
Learning a dual-language vector space for domain-specific cross-lingual question retrieval
Recommendations
Domain-specific cross-language relevant question retrieval
MSR '16: Proceedings of the 13th International Conference on Mining Software RepositoriesIn software development process, developers often seek solutions to the technical problems they encounter by searching relevant questions on Q&A sites. When developers fail to find solutions on Q&A sites in their native language (e.g., Chinese), they ...
Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalWe propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several ...
Cross-lingual word analogies using linear transformations between semantic spaces
Highlights- We generalize the word analogy task to evaluate cross-lingual semantic spaces.
- ...
AbstractThe ability to represent the meaning of words is one of the core parts of natural language understanding (NLU), with applications ranging across machine translation, summarization, question answering, information retrieval, etc. The ...
Comments