skip to main content
10.1145/2970276.2970317acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Learning a dual-language vector space for domain-specific cross-lingual question retrieval

Published:25 August 2016Publication History

ABSTRACT

The lingual barrier limits the ability of millions of non-English speaking developers to make effective use of the tremendous knowledge in Stack Overflow, which is archived in English. For cross-lingual question retrieval, one may use translation-based methods that first translate the non-English queries into English and then perform monolingual question retrieval in English. However, translation-based methods suffer from semantic deviation due to inappropriate translation, especially for domain-specific terms, and lexical gap between queries and questions that share few words in common. To overcome the above issues, we propose a novel cross-lingual question retrieval based on word embeddings and convolutional neural network (CNN) which are the state-of-the-art deep learning techniques to capture word- and sentence-level semantics. The CNN model is trained with large amounts of examples from Stack Overflow duplicate questions and their corresponding translation by machine, which guides the CNN to learn to capture informative word and sentence features to recognize and quantify semantic similarity in the presence of semantic deviations and lexical gaps. A uniqueness of our approach is that the trained CNN can map documents in two languages (e.g., Chinese queries and English questions) in a dual-language vector space, and thus reduce the cross-lingual question retrieval problem to a simple k-nearest neighbors search problem in the dual-language vector space, where no query or question translation is required. Our evaluation shows that our approach significantly outperforms the translation-based method, and can be extended to dual-language documents retrieval from different sources.

References

  1. Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. In Innovations in Machine Learning, pages 137–186. Springer, 2006.Google ScholarGoogle Scholar
  2. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Bogdanova, C. dos Santos, L. Barbosa, and B. Zadrozny. Detecting semantically equivalent questions in online user forums. CoNLL 2015, page 123, 2015.Google ScholarGoogle Scholar
  4. G. Capobianco, A. D. Lucia, R. Oliveto, A. Panichella, and S. Panichella. Improving ir-based traceability recovery via noun-based indexing of software artifacts. Journal of Software: Evolution and Process, 25(7):743–762, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. Chen, S. Gao, and Z. Xing. Mining analogical libraries in q&a discussions -incorporating relational and categorical knowledge into word embedding. In 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER), pages 338–348. IEEE, 2016.Google ScholarGoogle Scholar
  6. R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. H. Dalip, M. A. Gon¸ calves, M. Cristo, and P. Calado. Exploiting user feedback to learn to rank answers in q&a forums: a case study with stack overflow. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 543–552. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Dit, M. Revelle, and D. Poshyvanyk. Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empirical Software Engineering, 18(2):277–309, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. N. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In COLING, pages 69–78, 2014.Google ScholarGoogle Scholar
  11. A. Eisele and Y. Chen. Multiun: A multilingual corpus from united nation documents. In LREC, 2010.Google ScholarGoogle Scholar
  12. J. H. Hayes, H. Sultanov, W.-K. Kong, and W. Li. Software verification and validation research laboratory (svvrl) of the university of kentucky: traceability challenge 2011: language translation. In Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering, pages 50–53. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.Google ScholarGoogle Scholar
  15. D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Google ScholarGoogle Scholar
  16. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  18. O. A. L. Lemos, S. K. Bajracharya, J. Ossher, R. S. Morla, P. C. Masiero, P. Baldi, and C. V. Lopes. Codegenie: using test-cases to search and reuse source code. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 525–526. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google ScholarGoogle Scholar
  20. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Mou, G. Li, Y. Liu, H. Peng, Z. Jin, Y. Xu, and L. Zhang. Building program vector representations for deep learning. arXiv preprint arXiv:1409.3358, 2014.Google ScholarGoogle Scholar
  22. L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Paul and A. Prakash. A framework for source code search using program patterns. Software Engineering, IEEE Transactions on, 20(6):463–475, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275–281. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Reed and N. de Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.Google ScholarGoogle Scholar
  26. R. ˇ Reh˚ uˇ rek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010.Google ScholarGoogle Scholar
  27. ELRA. http://is.muni.cz/publication/884893/en.Google ScholarGoogle Scholar
  28. S. P. Reiss. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, pages 243–253. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3. NIST SPECIAL PUBLICATION SP, 109:109, 1995.Google ScholarGoogle Scholar
  30. M. P. Robillard and Y. B. Chhetri. Recommending reference api documentation. Empirical Software Engineering, 20(6):1558–1586, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.Google ScholarGoogle Scholar
  32. R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry. Improving bug localization using structured information retrieval. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, pages 345–355. IEEE, 2013.Google ScholarGoogle Scholar
  33. G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, page 1642. Citeseer, 2013.Google ScholarGoogle Scholar
  35. K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  36. S. Subramanian, L. Inozemtseva, and R. Holmes. Live api documentation. In Proceedings of the 36th International Conference on Software Engineering, pages 643–652. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Uszkoreit, J. M. Ponte, A. C. Popat, and M. Dubiner. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1101–1109. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85, 2008.Google ScholarGoogle Scholar
  39. J. Wang, X. Peng, Z. Xing, and W. Zhao. Improving feature location practice with multi-faceted interactive exploration. In Proceedings of the 2013 International Conference on Software Engineering, pages 762–771. IEEE Press, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Wang, D. Lo, and L. Jiang. Active code search: incorporating user feedback to improve code search relevance. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, pages 677–682. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk. Toward deep learning software repositories. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on, pages 334–345. IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. X. Xia, D. Lo, X. Wang, C. Zhang, and X. Wang. Cross-language bug localization. In Proceedings of the 22nd International Conference on Program Comprehension, pages 275–278. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. B. Xu, Z. Xing, X. Xia, D. Lo, Q. Wang, and S. Li. Domain-specific cross-language relevant question retrieval. In Proceedings of the 13th International Conference on Mining Software Repositories, page (To appear). IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  45. Y. Zhang and B. Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. CoRR, abs/1510.03820, 2015.Google ScholarGoogle Scholar
  46. X. Zheng, H. Chen, and T. Xu. Deep learning for chinese word segmentation and pos tagging. In EMNLP, pages 647–657, 2013.Google ScholarGoogle Scholar
  47. G. Zhou, T. He, J. Zhao, and P. Hu. Learning continuous word embedding with metadata for question retrieval in community question answering. In Proceedings of ACL, pages 250–259, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  48. Y. Zou, T. Ye, Y. Lu, J. Mylopoulos, and L. Zhang. Learning to rank for question-oriented software text retrieval (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 1–11. IEEE, 2015.Google ScholarGoogle Scholar

Index Terms

  1. Learning a dual-language vector space for domain-specific cross-lingual question retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering
        August 2016
        899 pages
        ISBN:9781450338455
        DOI:10.1145/2970276
        • General Chair:
        • David Lo,
        • Program Chairs:
        • Sven Apel,
        • Sarfraz Khurshid

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 August 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate82of337submissions,24%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader