skip to main content
10.1145/2884781.2884862acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

From word embeddings to document similarities for improved information retrieval in software engineering

Published:14 May 2016Publication History

ABSTRACT

The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.

References

  1. A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proc. ICSE '10, pages 375--384, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. K. Bajracharya, J. Ossher, and C. V. Lopes. Leveraging usage similarity for effective retrieval of examples in code repositories. In Proc. FSE '10, pages 157--166, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Baroni, G. Dinu, and G. Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proc. ACL '14, pages 238--247, Baltimore, Maryland, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  4. Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137--1155, Mar. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chatterjee, S. Juvekar, and K. Sen. Sniff: A search engine for java using free-form queries. In Proc. FASE '09, pages 385--400, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Chen and K. Zhang. Who asked what: Integrating crowdsourced FAQs into API documentation. In Proc. ICSE '14 Companion, pages 456--459, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proc. ICML '08, pages 160--167, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493--2537, Nov. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Dagenais and M. P. Robillard. Recovering traceability links between an API and its learning resources. In Proc. ICSE '12, pages 47--57, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Dasgupta, M. Grechanik, E. Moritz, B. Dit, and D. Poshyvanyk. Enhancing software traceability by automatically expanding corpora with relevant documentation. In Proc. ICSM '13, pages 320--329, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proc. ICSE '13, pages 842--851, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Harris. Distributional structure. Word, 10(23):146--162, 1954.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proc. MSR '13, pages 377--386, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims. Optimizing search engines using clickthrough data. In Proc. KDD '02, pages 133--142, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Joachims. Training linear SVMs in linear time. In Proc. KDD '06, pages 217--226, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Kawaguchi, P. Garg, M. Matsushita, and K. Inoue. MUDABlue: an automatic categorization system for open source repositories. In Proc. APSEC '04, pages 184--193, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. S. Kochhar, Y. Tian, and D. Lo. Potential biases in bug localization: Do they matter? In Proc. ASE '14, pages 803--814, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In Proc. of ICML, 2015.Google ScholarGoogle Scholar
  19. T. LANDAUER and S. DUMAIS. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211--240, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  20. Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proc. ICML '14, pages 1188--1196, 2014.Google ScholarGoogle Scholar
  21. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Proc. NIPS 27, pages 2177--2185. 2014.Google ScholarGoogle Scholar
  22. S. K. Lukins, N. A. Kraft, and L. H. Etzkorn. Bug localization using Latent Dirichlet Allocation. Information and Software Technology, 52(9):972--990, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  24. A. Marcus and G. Antoniol. On the use of text retrieval techniques in software engineering. In Proc. ICSE '12, Technical Briefing, 2012.Google ScholarGoogle Scholar
  25. A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An information retrieval approach to concept location in source code. In Proc. WCRE '04, pages 214--223, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proc. ICSE '12, pages 364--374, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering, 38(5):1069--1087, Sept 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proc. AAAI '06, pages 775--780, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proc. of Workshop at ICLR '13, 2013.Google ScholarGoogle Scholar
  30. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proc. NIPS 26, pages 3111--3119, 2013.Google ScholarGoogle Scholar
  31. T. Mikolov, W.-T. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Proc. NAACL-HLT-2013, 2013.Google ScholarGoogle Scholar
  32. A. T. Nguyen, T. T. Nguyen, J. Al-Kofahi, H. V. Nguyen, and T. N. Nguyen. A topic-based approach for narrowing the search space of buggy files from a bug report. In Proc. ASE '11, pages 263--272, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Pantel and D. Lin. Discovering word senses from text. In Proc. KDD '02, pages 613--619, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Parnin, C. Treude, L. Grammel, and M.-A. Storey. Crowd documentation: Exploring the coverage and the dynamics of API discussions on stack overflow. Technical Report GIT-CS-12-05, Georgia Institute of Technology, May 2012.Google ScholarGoogle Scholar
  35. D. Poshyvanyk, Y.-G. Gueheneuc, A. Marcus, G. Antoniol, and V. Rajlich. Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Transactions on Software Engineering, 33(6):420--432, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Poshyvanyk, A. Marcus, V. Rajlich, Y.-G. Gueheneuc, and G. Antoniol. Combining probabilistic ranking and Latent Semantic Indexing for feature identification. In Proc. ICPC '06, pages 137--148, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Rao and A. Kak. Retrieval from software libraries for bug localization: A comparative study of generic and composite text models. In Proc. MSR '11, pages 43--52, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Saha, M. Lease, S. Khurshid, and D. Perry. Improving bug localization using structured information retrieval. In Proc. ASE'13, pages 345--355, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Stylos and B. A. Myers. Mica: A Web-search tool for finding API components and examples. In Proc. VLHCC '06, pages 195--202, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Subramanian, L. Inozemtseva, and R. Holmes. Live API documentation. In Proc. ICSE '14, pages 643--652, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specific word similarity database. In Proc. CSMR-WCRE '14, pages 44--53, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  42. Y. Tian, D. Lo, and J. Lawall. SEWordSim: Software-specific word similarity database. In Proc. ICSE '14 Companion, pages 568--571, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. E. M. Voorhees. The TREC-8 question answering track report. In Proc. TREC-8, pages 77--82, 1999.Google ScholarGoogle Scholar
  44. S. Wang, D. Lo, and L. Jiang. Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. In Proc. ICSM '12, pages 604--607, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Yang and L. Tan. Inferring semantically related words from software context. In Proc. MSR '12, pages 161--170, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proc. FSE '14, pages 689--699, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Zhou, H. Zhang, and D. Lo. Where should the bugs be fixed? - more accurate information retrieval-based bug localization based on bug reports. In Proc. ICSE '12, pages 14--24, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. From word embeddings to document similarities for improved information retrieval in software engineering

            Recommendations

            Reviews

            Mariam Kiran

            Writing comments in our code is one of the main practices of software engineering. The authors use mapping to map the software comments to the code being described. The authors do a very good job talking about natural language processing and semantic processes in a different way. They use vectors and shared memory maps to understand natural language statements and code snippets, to understand their meaning. Usually, we use web ontology language (OWL) ontologies in these situations, which are documents that the semantic world uses to map words to a data dictionary. But it is still difficult to parse these statements to extract the meaning of the words and the context in which they are used. Investigating techniques such as latent semantic indexing (LSI) and latent Dirichlet allocation (LDA) for feature location and bug localization, the authors improve their research by categorizing the results in four categories. In the research questions, they (1) add word embedding to help improve extraction, (2) train word embedding, and (3) investigate whether training helps improve results and (4) can similarity be predicted. The techniques described seem very similar to clustering and prediction methods from machine learning techniques, used to understand text. This presents a strong approach for natural language and semantic processing researchers, where text can be trained to understand meanings. In this case, the paper attempts to apply this to find software bugs, which is a very interesting case study. This is a new approach that should definitely be expanded on in future work. Online Computing Reviews Service

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              ICSE '16: Proceedings of the 38th International Conference on Software Engineering
              May 2016
              1235 pages
              ISBN:9781450339001
              DOI:10.1145/2884781

              Copyright © 2016 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 14 May 2016

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate276of1,856submissions,15%

              Upcoming Conference

              ICSE 2025

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader