research-article

From word embeddings to document similarities for improved information retrieval in software engineering

Authors:
Xin Ye

Ohio University Athens, Ohio

Ohio University Athens, Ohio
View Profile

,
Hui Shen

Ohio University Athens, Ohio

Ohio University Athens, Ohio
View Profile

,
Xiao Ma

Ohio University Athens, Ohio

Ohio University Athens, Ohio
View Profile

,
Razvan Bunescu

Ohio University Athens, Ohio

Ohio University Athens, Ohio
View Profile

,
Chang Liu

Ohio University Athens, Ohio

Ohio University Athens, Ohio
View Profile

ICSE '16: Proceedings of the 38th International Conference on Software EngineeringMay 2016Pages 404–415https://doi.org/10.1145/2884781.2884862

Published:14 May 2016Publication History

ICSE '16: Proceedings of the 38th International Conference on Software Engineering

Pages 404–415

ABSTRACT

The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.

References

A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proc. ICSE '10, pages 375--384, 2010. Google ScholarDigital Library
S. K. Bajracharya, J. Ossher, and C. V. Lopes. Leveraging usage similarity for effective retrieval of examples in code repositories. In Proc. FSE '10, pages 157--166, 2010. Google ScholarDigital Library
M. Baroni, G. Dinu, and G. Kruszewski. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proc. ACL '14, pages 238--247, Baltimore, Maryland, 2014.Google ScholarCross Ref
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137--1155, Mar. 2003. Google ScholarDigital Library
S. Chatterjee, S. Juvekar, and K. Sen. Sniff: A search engine for java using free-form queries. In Proc. FASE '09, pages 385--400, 2009. Google ScholarDigital Library
C. Chen and K. Zhang. Who asked what: Integrating crowdsourced FAQs into API documentation. In Proc. ICSE '14 Companion, pages 456--459, 2014. Google ScholarDigital Library
R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proc. ICML '08, pages 160--167, 2008. Google ScholarDigital Library
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493--2537, Nov. 2011. Google ScholarDigital Library
B. Dagenais and M. P. Robillard. Recovering traceability links between an API and its learning resources. In Proc. ICSE '12, pages 47--57, 2012. Google ScholarDigital Library
T. Dasgupta, M. Grechanik, E. Moritz, B. Dit, and D. Poshyvanyk. Enhancing software traceability by automatically expanding corpora with relevant documentation. In Proc. ICSM '13, pages 320--329, 2013. Google ScholarDigital Library
S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proc. ICSE '13, pages 842--851, 2013. Google ScholarDigital Library
Z. Harris. Distributional structure. Word, 10(23):146--162, 1954.Google ScholarCross Ref
M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proc. MSR '13, pages 377--386, 2013. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. In Proc. KDD '02, pages 133--142, 2002. Google ScholarDigital Library
T. Joachims. Training linear SVMs in linear time. In Proc. KDD '06, pages 217--226, 2006. Google ScholarDigital Library
S. Kawaguchi, P. Garg, M. Matsushita, and K. Inoue. MUDABlue: an automatic categorization system for open source repositories. In Proc. APSEC '04, pages 184--193, 2004. Google ScholarDigital Library
P. S. Kochhar, Y. Tian, and D. Lo. Potential biases in bug localization: Do they matter? In Proc. ASE '14, pages 803--814, 2014. Google ScholarDigital Library
M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In Proc. of ICML, 2015.Google Scholar
T. LANDAUER and S. DUMAIS. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211--240, 1997.Google ScholarCross Ref
Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proc. ICML '14, pages 1188--1196, 2014.Google Scholar
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Proc. NIPS 27, pages 2177--2185. 2014.Google Scholar
S. K. Lukins, N. A. Kraft, and L. H. Etzkorn. Bug localization using Latent Dirichlet Allocation. Information and Software Technology, 52(9):972--990, Sept. 2010. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. Google ScholarCross Ref
A. Marcus and G. Antoniol. On the use of text retrieval techniques in software engineering. In Proc. ICSE '12, Technical Briefing, 2012.Google Scholar
A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An information retrieval approach to concept location in source code. In Proc. WCRE '04, pages 214--223, 2004. Google ScholarDigital Library
C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proc. ICSE '12, pages 364--374, 2012. Google ScholarDigital Library
C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering, 38(5):1069--1087, Sept 2012. Google ScholarDigital Library
R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proc. AAAI '06, pages 775--780, 2006. Google ScholarDigital Library
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proc. of Workshop at ICLR '13, 2013.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proc. NIPS 26, pages 3111--3119, 2013.Google Scholar
T. Mikolov, W.-T. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Proc. NAACL-HLT-2013, 2013.Google Scholar
A. T. Nguyen, T. T. Nguyen, J. Al-Kofahi, H. V. Nguyen, and T. N. Nguyen. A topic-based approach for narrowing the search space of buggy files from a bug report. In Proc. ASE '11, pages 263--272, 2011. Google ScholarDigital Library
P. Pantel and D. Lin. Discovering word senses from text. In Proc. KDD '02, pages 613--619, 2002. Google ScholarDigital Library
C. Parnin, C. Treude, L. Grammel, and M.-A. Storey. Crowd documentation: Exploring the coverage and the dynamics of API discussions on stack overflow. Technical Report GIT-CS-12-05, Georgia Institute of Technology, May 2012.Google Scholar
D. Poshyvanyk, Y.-G. Gueheneuc, A. Marcus, G. Antoniol, and V. Rajlich. Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Transactions on Software Engineering, 33(6):420--432, June 2007. Google ScholarDigital Library
D. Poshyvanyk, A. Marcus, V. Rajlich, Y.-G. Gueheneuc, and G. Antoniol. Combining probabilistic ranking and Latent Semantic Indexing for feature identification. In Proc. ICPC '06, pages 137--148, 2006. Google ScholarDigital Library
S. Rao and A. Kak. Retrieval from software libraries for bug localization: A comparative study of generic and composite text models. In Proc. MSR '11, pages 43--52, 2011. Google ScholarDigital Library
R. Saha, M. Lease, S. Khurshid, and D. Perry. Improving bug localization using structured information retrieval. In Proc. ASE'13, pages 345--355, 2013.Google ScholarDigital Library
J. Stylos and B. A. Myers. Mica: A Web-search tool for finding API components and examples. In Proc. VLHCC '06, pages 195--202, 2006. Google ScholarDigital Library
S. Subramanian, L. Inozemtseva, and R. Holmes. Live API documentation. In Proc. ICSE '14, pages 643--652, 2014. Google ScholarDigital Library
Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specific word similarity database. In Proc. CSMR-WCRE '14, pages 44--53, 2014.Google ScholarCross Ref
Y. Tian, D. Lo, and J. Lawall. SEWordSim: Software-specific word similarity database. In Proc. ICSE '14 Companion, pages 568--571, 2014. Google ScholarDigital Library
E. M. Voorhees. The TREC-8 question answering track report. In Proc. TREC-8, pages 77--82, 1999.Google Scholar
S. Wang, D. Lo, and L. Jiang. Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. In Proc. ICSM '12, pages 604--607, 2012. Google ScholarDigital Library
J. Yang and L. Tan. Inferring semantically related words from software context. In Proc. MSR '12, pages 161--170, 2012. Google ScholarDigital Library
X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proc. FSE '14, pages 689--699, 2014. Google ScholarDigital Library
J. Zhou, H. Zhang, and D. Lo. Where should the bugs be fixed? - more accurate information retrieval-based bug localization based on bug reports. In Proc. ICSE '12, pages 14--24, 2012. Google ScholarDigital Library

Index Terms

From word embeddings to document similarities for improved information retrieval in software engineering

Recommendations

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several ...
Read More
Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval
ICIA-16: Proceedings of the International Conference on Informatics and Analytics

Todays, The internet has become a source of multi-lingual content. Users are not aware of multiple languages, so the language diversity becomes a great barrier for world communication. Cross-Language Information Retrieval (CLIR) provides a solution for ...
Read More
Applications of tf-idf concept to improve monolingual and cross-language information retrieval based on word embeddings
AISS '19: Proceedings of the 1st International Conference on Advanced Information Science and System

This work applied word embeddings for English monolingual information retrieval and Dutch-English cross-language information retrieval. Besides word embeddings, this work also applied tf-idf concept to increase result of relevant documents. We present ...
Read More

Reviews

Reviewer: Mariam Kiran

Writing comments in our code is one of the main practices of software engineering. The authors use mapping to map the software comments to the code being described. The authors do a very good job talking about natural language processing and semantic processes in a different way. They use vectors and shared memory maps to understand natural language statements and code snippets, to understand their meaning. Usually, we use web ontology language (OWL) ontologies in these situations, which are documents that the semantic world uses to map words to a data dictionary. But it is still difficult to parse these statements to extract the meaning of the words and the context in which they are used. Investigating techniques such as latent semantic indexing (LSI) and latent Dirichlet allocation (LDA) for feature location and bug localization, the authors improve their research by categorizing the results in four categories. In the research questions, they (1) add word embedding to help improve extraction, (2) train word embedding, and (3) investigate whether training helps improve results and (4) can similarity be predicted. The techniques described seem very similar to clustering and prediction methods from machine learning techniques, used to understand text. This presents a strong approach for natural language and semantic processing researchers, where text can be trained to understand meanings. In this case, the paper attempts to apply this to find software bugs, which is a very interesting case study. This is a new approach that should definitely be expanded on in future work. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '16: Proceedings of the 38th International Conference on Software Engineering
May 2016
1235 pages
ISBN:9781450339001
DOI:10.1145/2884781
General Chair:
Laura Dillon
Michigan State University
,
Program Chairs:
Willem Visser
Stellenbosch University, South Africa
,
Laurie Williams
North Carolina State University
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
API documents
bug localization
bug reports
skip-gram model
word embeddings
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 205
  Total Citations
  View Citations
- 1,863
  Total Downloads
- Downloads (Last 12 months)131
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

From word embeddings to document similarities for improved information retrieval in software engineering

ICSE '16: Proceedings of the 38th International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval

Applications of tf-idf concept to improve monolingual and cross-language information retrieval based on word embeddings

Reviews

Access critical reviews of Computing literature here