ABSTRACT
The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.
- A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proc. ICSE '10, pages 375--384, 2010. Google ScholarDigital Library
- S. K. Bajracharya, J. Ossher, and C. V. Lopes. Leveraging usage similarity for effective retrieval of examples in code repositories. In Proc. FSE '10, pages 157--166, 2010. Google ScholarDigital Library
- M. Baroni, G. Dinu, and G. Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proc. ACL '14, pages 238--247, Baltimore, Maryland, 2014.Google ScholarCross Ref
- Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137--1155, Mar. 2003. Google ScholarDigital Library
- S. Chatterjee, S. Juvekar, and K. Sen. Sniff: A search engine for java using free-form queries. In Proc. FASE '09, pages 385--400, 2009. Google ScholarDigital Library
- C. Chen and K. Zhang. Who asked what: Integrating crowdsourced FAQs into API documentation. In Proc. ICSE '14 Companion, pages 456--459, 2014. Google ScholarDigital Library
- R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proc. ICML '08, pages 160--167, 2008. Google ScholarDigital Library
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493--2537, Nov. 2011. Google ScholarDigital Library
- B. Dagenais and M. P. Robillard. Recovering traceability links between an API and its learning resources. In Proc. ICSE '12, pages 47--57, 2012. Google ScholarDigital Library
- T. Dasgupta, M. Grechanik, E. Moritz, B. Dit, and D. Poshyvanyk. Enhancing software traceability by automatically expanding corpora with relevant documentation. In Proc. ICSM '13, pages 320--329, 2013. Google ScholarDigital Library
- S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proc. ICSE '13, pages 842--851, 2013. Google ScholarDigital Library
- Z. Harris. Distributional structure. Word, 10(23):146--162, 1954.Google ScholarCross Ref
- M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proc. MSR '13, pages 377--386, 2013. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In Proc. KDD '02, pages 133--142, 2002. Google ScholarDigital Library
- T. Joachims. Training linear SVMs in linear time. In Proc. KDD '06, pages 217--226, 2006. Google ScholarDigital Library
- S. Kawaguchi, P. Garg, M. Matsushita, and K. Inoue. MUDABlue: an automatic categorization system for open source repositories. In Proc. APSEC '04, pages 184--193, 2004. Google ScholarDigital Library
- P. S. Kochhar, Y. Tian, and D. Lo. Potential biases in bug localization: Do they matter? In Proc. ASE '14, pages 803--814, 2014. Google ScholarDigital Library
- M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In Proc. of ICML, 2015.Google Scholar
- T. LANDAUER and S. DUMAIS. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211--240, 1997.Google ScholarCross Ref
- Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proc. ICML '14, pages 1188--1196, 2014.Google Scholar
- O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Proc. NIPS 27, pages 2177--2185. 2014.Google Scholar
- S. K. Lukins, N. A. Kraft, and L. H. Etzkorn. Bug localization using Latent Dirichlet Allocation. Information and Software Technology, 52(9):972--990, Sept. 2010. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. Google ScholarCross Ref
- A. Marcus and G. Antoniol. On the use of text retrieval techniques in software engineering. In Proc. ICSE '12, Technical Briefing, 2012.Google Scholar
- A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An information retrieval approach to concept location in source code. In Proc. WCRE '04, pages 214--223, 2004. Google ScholarDigital Library
- C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proc. ICSE '12, pages 364--374, 2012. Google ScholarDigital Library
- C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering, 38(5):1069--1087, Sept 2012. Google ScholarDigital Library
- R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proc. AAAI '06, pages 775--780, 2006. Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proc. of Workshop at ICLR '13, 2013.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proc. NIPS 26, pages 3111--3119, 2013.Google Scholar
- T. Mikolov, W.-T. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Proc. NAACL-HLT-2013, 2013.Google Scholar
- A. T. Nguyen, T. T. Nguyen, J. Al-Kofahi, H. V. Nguyen, and T. N. Nguyen. A topic-based approach for narrowing the search space of buggy files from a bug report. In Proc. ASE '11, pages 263--272, 2011. Google ScholarDigital Library
- P. Pantel and D. Lin. Discovering word senses from text. In Proc. KDD '02, pages 613--619, 2002. Google ScholarDigital Library
- C. Parnin, C. Treude, L. Grammel, and M.-A. Storey. Crowd documentation: Exploring the coverage and the dynamics of API discussions on stack overflow. Technical Report GIT-CS-12-05, Georgia Institute of Technology, May 2012.Google Scholar
- D. Poshyvanyk, Y.-G. Gueheneuc, A. Marcus, G. Antoniol, and V. Rajlich. Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Transactions on Software Engineering, 33(6):420--432, June 2007. Google ScholarDigital Library
- D. Poshyvanyk, A. Marcus, V. Rajlich, Y.-G. Gueheneuc, and G. Antoniol. Combining probabilistic ranking and Latent Semantic Indexing for feature identification. In Proc. ICPC '06, pages 137--148, 2006. Google ScholarDigital Library
- S. Rao and A. Kak. Retrieval from software libraries for bug localization: A comparative study of generic and composite text models. In Proc. MSR '11, pages 43--52, 2011. Google ScholarDigital Library
- R. Saha, M. Lease, S. Khurshid, and D. Perry. Improving bug localization using structured information retrieval. In Proc. ASE'13, pages 345--355, 2013.Google ScholarDigital Library
- J. Stylos and B. A. Myers. Mica: A Web-search tool for finding API components and examples. In Proc. VLHCC '06, pages 195--202, 2006. Google ScholarDigital Library
- S. Subramanian, L. Inozemtseva, and R. Holmes. Live API documentation. In Proc. ICSE '14, pages 643--652, 2014. Google ScholarDigital Library
- Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specific word similarity database. In Proc. CSMR-WCRE '14, pages 44--53, 2014.Google ScholarCross Ref
- Y. Tian, D. Lo, and J. Lawall. SEWordSim: Software-specific word similarity database. In Proc. ICSE '14 Companion, pages 568--571, 2014. Google ScholarDigital Library
- E. M. Voorhees. The TREC-8 question answering track report. In Proc. TREC-8, pages 77--82, 1999.Google Scholar
- S. Wang, D. Lo, and L. Jiang. Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. In Proc. ICSM '12, pages 604--607, 2012. Google ScholarDigital Library
- J. Yang and L. Tan. Inferring semantically related words from software context. In Proc. MSR '12, pages 161--170, 2012. Google ScholarDigital Library
- X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proc. FSE '14, pages 689--699, 2014. Google ScholarDigital Library
- J. Zhou, H. Zhang, and D. Lo. Where should the bugs be fixed? - more accurate information retrieval-based bug localization based on bug reports. In Proc. ICSE '12, pages 14--24, 2012. Google ScholarDigital Library
Index Terms
- From word embeddings to document similarities for improved information retrieval in software engineering
Recommendations
Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalWe propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several ...
Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval
ICIA-16: Proceedings of the International Conference on Informatics and AnalyticsTodays, The internet has become a source of multi-lingual content. Users are not aware of multiple languages, so the language diversity becomes a great barrier for world communication. Cross-Language Information Retrieval (CLIR) provides a solution for ...
Applications of tf-idf concept to improve monolingual and cross-language information retrieval based on word embeddings
AISS '19: Proceedings of the 1st International Conference on Advanced Information Science and SystemThis work applied word embeddings for English monolingual information retrieval and Dutch-English cross-language information retrieval. Besides word embeddings, this work also applied tf-idf concept to increase result of relevant documents. We present ...
Comments