research-article

Predicting Program Properties from "Big Code"

Authors:

Veselin Raychev,

Andreas KrauseAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 1

Pages 111 - 124

https://doi.org/10.1145/2775051.2677009

Published: 14 January 2015 Publication History

Abstract

We present a new approach for predicting program properties from massive codebases (aka "Big Code"). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.

The key idea of our work is to transform the input program into a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic graphical models such as conditional random fields (CRFs) in order to perform joint prediction of program properties.

As an example of our approach, we built a scalable prediction engine called JSNice for solving two kinds of problems in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNice predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the cases. In the first week since its release, JSNice was used by more than 30,000 developers and in only few months has become a popular tool in the JavaScript developer community.

By formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context, our work opens up new possibilities for attacking a wide range of difficult problems in the context of "Big Code" including invariant generation, decompilation, synthesis and others.

Supplementary Material

MOV File (2677009.mov)

Download
17424.90 MB

References

[1]

ALLAMANIS, M., AND SUTTON, C. Mining source code repositories at massive scale using language modeling. In MSR (2013).

Digital Library

[2]

ANDRZEJEWSKI, D., MULHERN, A., LIBLIT, B., AND ZHU, X. Statistical debugging using latent topic models. In ECML (2007).

Digital Library

[3]

BECKMAN, N. E., AND NORI, A. V. Probabilistic, modular and scalable inference of typestate specifications. PLDI '11, pp. 211--221.

Digital Library

[4]

BESAG, J. On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society. Series B (Methodol.) 48, 3 (1986), 259--302.

[5]

BLEI, D., AND LAFFERTY, J. Topic models. In Text Mining: Classification, Clustering, and Applications. 2009.

[6]

Closure compiler. https://developers.google.com/closure/compiler/.

[7]

Mining big code to improve software reliability and construction. http://www.darpa.mil/NewsEvents/Releases/2014/03/06a.aspx.

[8]

FINLEY, T., AND JOACHIMS, T. Training structural svms when exact inference is intractable. In ICML (2008), pp. 304--311.

Digital Library

[9]

GULWANI, S., AND JOJIC, N. Program verification as probabilistic inference. POPL '07, ACM, pp. 277--289.

Digital Library

[10]

HE, X., ZEMEL, R. S., AND CARREIRA-PERPIÑÁN, M. A. Multi-scale conditional random fields for image labeling. CVPR '04.

Digital Library

[11]

JENSEN, S. H., MØLLER, A., AND THIEMANN, P. Type analysis for javascript. In SAS'09.

Digital Library

[12]

KAPPES, J. H., ET AL. A comparative study of modern inference techniques for discrete energy minimization problems. CVPR'13.

Digital Library

[13]

KARAIVANOV, S., RAYCHEV, V., AND VECHEV, M. Phrase-based statistical translation of programming languages. Onward! '14.

Digital Library

[14]

KOLLER, D., AND FRIEDMAN, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

Digital Library

[15]

KREMENEK, T., NG, A. Y., AND ENGLER, D. A factor graph model for software bug finding. IJCAI'07.

Digital Library

[16]

KREMENEK, T., TWOHEY, P., BACK, G., NG, A., AND ENGLER, D. From uncertainty to belief: Inferring the specification within. OSDI'06, USENIX Association, pp. 161--176.

Digital Library

[17]

LAFFERTY, J. D., MCCALLUM, A., AND PEREIRA, F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML '01, pp. 282--289.

Digital Library

[18]

LIVSHITS, B., NORI, A. V., R AJAMANI, S. K., AND BANERJEE, A. Merlin: Specification inference for explicit information flow problems. PLDI '09, ACM, pp. 75--86.

Digital Library

[19]

MADDISON, C. J., AND TARLOW, D. Structured generative models of natural source code.

[20]

MURPHY, K. P. Machine learning: a probabilistic perspective. Cambridge, MA, 2012.

Digital Library

[21]

PINTO, D., MCCALLUM, A., WEI, X., AND CROFT, W. B. Table extraction using conditional random fields. SIGIR '03, pp. 235--242.

Digital Library

[22]

QUATTONI, A., COLLINS, M., AND DARRELL, T. Conditional random fields for object recognition. In NIPS (2004), pp. 1097--1104.

[23]

RATLIFF, N. D., BAGNELL, J. A., AND ZINKEVICH, M. (approximate) subgradient methods for structured prediction. In AISTATS (2007), pp. 380--387.

[24]

RAYCHEV, V., VECHEV, M., AND YAHAV, E. Code completion with statistical language models. PLDI '14, ACM, pp. 419--428.

Digital Library

[25]

STEENSGAARD, B. Points-to analysis in almost linear time. POPL'96, pp. 32--41.

Digital Library

[26]

TASKAR, B., GUESTRIN, C., AND KOLLER, D. Max-margin markov networks. In NIPS (2003).

Digital Library

[27]

TSOCHANTARIDIS, I., JOACHIMS, T., HOFMANN, T., AND ALTUN, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 2005, 1453--1484.

Digital Library

[28]

Typescript language. http://www.typescriptlang.org/.

[29]

ZINKEVICH, M., WEIMER, M., LI, L., AND SMOLA, A. J. Parallelized stochastic gradient descent. In NIPS (2010), pp. 2595--2603.

Digital Library

Cited By

Yang CChen JJiang JHuang Y(2024)Dependency-Aware Code NaturalnessProceedings of the ACM on Programming Languages10.1145/36897948:OOPSLA2(2355-2377)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689794
Dutta AJannesari A(2024)MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance OptimizationsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676895(156-167)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676895
Guo YChen ZChen LXu WLi YZhou YXu B(2024)Generating Python Type Annotations from Type Inference: How Far Are We?ACM Transactions on Software Engineering and Methodology10.1145/365215333:5(1-38)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3652153
Show More Cited By

Recommendations

A Survey of Machine Learning for Big Code and Naturalness

Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this ...
code2vec: learning distributed representations of code

We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of ...
Code completion with statistical language models
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation

We address the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for holes with the most likely sequences of method calls.

Our main idea is to reduce the problem of code completion to ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 1

POPL '15

January 2015

682 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2775051

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
January 2015
716 pages
ISBN:9781450333009
DOI:10.1145/2676726
General Chair:
Sriram Rajamani
Microsoft Research, India
,
Program Chair:
David Walker
Princeton University, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 January 2015

Published in SIGPLAN Volume 50, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

252
Total Citations
View Citations
2,239
Total Downloads

Downloads (Last 12 months)157
Downloads (Last 6 weeks)12

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang CChen JJiang JHuang Y(2024)Dependency-Aware Code NaturalnessProceedings of the ACM on Programming Languages10.1145/36897948:OOPSLA2(2355-2377)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689794
Dutta AJannesari A(2024)MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance OptimizationsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676895(156-167)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676895
Guo YChen ZChen LXu WLi YZhou YXu B(2024)Generating Python Type Annotations from Type Inference: How Far Are We?ACM Transactions on Software Engineering and Methodology10.1145/365215333:5(1-38)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3652153
Zhang BRoychoudhury APaiva AAbreu RStorey M(2024)Towards Finding Accounting Errors in Smart ContractsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639128(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639128
Breyton GSaqib MFung BCharland P(2024)BETAC: Bidirectional Encoder Transformer for Assembly Code Function Name Recovery2024 20th International Conference on the Design of Reliable Communication Networks (DRCN)10.1109/DRCN60692.2024.10539155(1-8)Online publication date: 6-May-2024
https://doi.org/10.1109/DRCN60692.2024.10539155
Keshavarz HRodríguez-Pérez G(2024)JITGNNJournal of Systems and Software10.1016/j.jss.2024.111984210:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.111984
Cerone A(2024)Multifaceted Formal Methods and their Interdisciplinary Role — from the Cathedral of ‘Components as Coalgebras’ to the HCI Context and the Open Source Software BazaarJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2024.101006(101006)Online publication date: Sep-2024
https://doi.org/10.1016/j.jlamp.2024.101006
Sandoval Alcocer JCamacho-Jaimes HGalindo-Gutierrez GNeyem ABergel ADucasse S(2024)On the use of statistical machine translation for suggesting variable names for decompiled code: The Pharo caseJournal of Computer Languages10.1016/j.cola.2024.10127179(101271)Online publication date: Jun-2024
https://doi.org/10.1016/j.cola.2024.101271
Sarker SSchulz KNahapetyan ADas AKapravelos A(2024)JSHint: Revealing API Usage to Improve Detection of Malicious JavaScriptInformation Security10.1007/978-3-031-75764-8_11(205-225)Online publication date: 17-Oct-2024
https://doi.org/10.1007/978-3-031-75764-8_11
Vagin ARomanov VIvanov V(2024)Evaluating Baselines for Type Inference: Static Code Analysis Versus Large Language ModelIntelligent Systems Design and Applications10.1007/978-3-031-64779-6_42(435-444)Online publication date: 25-Jul-2024
https://doi.org/10.1007/978-3-031-64779-6_42
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents