skip to main content
10.1145/2676726.2677009acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

Predicting Program Properties from "Big Code"

Published: 14 January 2015 Publication History

Abstract

We present a new approach for predicting program properties from massive codebases (aka "Big Code"). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.
The key idea of our work is to transform the input program into a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic graphical models such as conditional random fields (CRFs) in order to perform joint prediction of program properties.
As an example of our approach, we built a scalable prediction engine called JSNice for solving two kinds of problems in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNice predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the cases. In the first week since its release, JSNice was used by more than 30,000 developers and in only few months has become a popular tool in the JavaScript developer community.
By formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context, our work opens up new possibilities for attacking a wide range of difficult problems in the context of "Big Code" including invariant generation, decompilation, synthesis and others.

Supplementary Material

MOV File (2677009.mov)

References

[1]
ALLAMANIS, M., AND SUTTON, C. Mining source code repositories at massive scale using language modeling. In MSR (2013).
[2]
ANDRZEJEWSKI, D., MULHERN, A., LIBLIT, B., AND ZHU, X. Statistical debugging using latent topic models. In ECML (2007).
[3]
BECKMAN, N. E., AND NORI, A. V. Probabilistic, modular and scalable inference of typestate specifications. PLDI '11, pp. 211--221.
[4]
BESAG, J. On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society. Series B (Methodol.) 48, 3 (1986), 259--302.
[5]
BLEI, D., AND LAFFERTY, J. Topic models. In Text Mining: Classification, Clustering, and Applications. 2009.
[6]
Closure compiler. https://developers.google.com/closure/compiler/.
[7]
Mining big code to improve software reliability and construction. http://www.darpa.mil/NewsEvents/Releases/2014/03/06a.aspx.
[8]
FINLEY, T., AND JOACHIMS, T. Training structural svms when exact inference is intractable. In ICML (2008), pp. 304--311.
[9]
GULWANI, S., AND JOJIC, N. Program verification as probabilistic inference. POPL '07, ACM, pp. 277--289.
[10]
HE, X., ZEMEL, R. S., AND CARREIRA-PERPIÑÁN, M. A. Multi-scale conditional random fields for image labeling. CVPR '04.
[11]
JENSEN, S. H., MØLLER, A., AND THIEMANN, P. Type analysis for javascript. In SAS'09.
[12]
KAPPES, J. H., ET AL. A comparative study of modern inference techniques for discrete energy minimization problems. CVPR'13.
[13]
KARAIVANOV, S., RAYCHEV, V., AND VECHEV, M. Phrase-based statistical translation of programming languages. Onward! '14.
[14]
KOLLER, D., AND FRIEDMAN, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
[15]
KREMENEK, T., NG, A. Y., AND ENGLER, D. A factor graph model for software bug finding. IJCAI'07.
[16]
KREMENEK, T., TWOHEY, P., BACK, G., NG, A., AND ENGLER, D. From uncertainty to belief: Inferring the specification within. OSDI'06, USENIX Association, pp. 161--176.
[17]
LAFFERTY, J. D., MCCALLUM, A., AND PEREIRA, F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML '01, pp. 282--289.
[18]
LIVSHITS, B., NORI, A. V., R AJAMANI, S. K., AND BANERJEE, A. Merlin: Specification inference for explicit information flow problems. PLDI '09, ACM, pp. 75--86.
[19]
MADDISON, C. J., AND TARLOW, D. Structured generative models of natural source code.
[20]
MURPHY, K. P. Machine learning: a probabilistic perspective. Cambridge, MA, 2012.
[21]
PINTO, D., MCCALLUM, A., WEI, X., AND CROFT, W. B. Table extraction using conditional random fields. SIGIR '03, pp. 235--242.
[22]
QUATTONI, A., COLLINS, M., AND DARRELL, T. Conditional random fields for object recognition. In NIPS (2004), pp. 1097--1104.
[23]
RATLIFF, N. D., BAGNELL, J. A., AND ZINKEVICH, M. (approximate) subgradient methods for structured prediction. In AISTATS (2007), pp. 380--387.
[24]
RAYCHEV, V., VECHEV, M., AND YAHAV, E. Code completion with statistical language models. PLDI '14, ACM, pp. 419--428.
[25]
STEENSGAARD, B. Points-to analysis in almost linear time. POPL'96, pp. 32--41.
[26]
TASKAR, B., GUESTRIN, C., AND KOLLER, D. Max-margin markov networks. In NIPS (2003).
[27]
TSOCHANTARIDIS, I., JOACHIMS, T., HOFMANN, T., AND ALTUN, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 2005, 1453--1484.
[28]
Typescript language. http://www.typescriptlang.org/.
[29]
ZINKEVICH, M., WEIMER, M., LI, L., AND SMOLA, A. J. Parallelized stochastic gradient descent. In NIPS (2010), pp. 2595--2603.

Cited By

View all
  • (2024)Comparing semantic graph representations of source code: The case of automatic feedback on programming assignmentsComputer Science and Information Systems10.2298/CSIS230615004P21:1(117-142)Online publication date: 2024
  • (2024)Generating Function Names to Improve Comprehension of Synthesized Programs2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VL/HCC60511.2024.00035(248-259)Online publication date: 2-Sep-2024
  • (2024)Enhancing Identifier Naming Through Multi-Mask Fine-Tuning of Language Models of Code2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM63643.2024.00017(71-82)Online publication date: 7-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
January 2015
716 pages
ISBN:9781450333009
DOI:10.1145/2676726
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 1
    POPL '15
    January 2015
    682 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2775051
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 January 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big code
  2. closure compiler
  3. conditional random fields
  4. javascript
  5. names
  6. program properties
  7. structured prediction
  8. types

Qualifiers

  • Research-article

Conference

POPL '15
Sponsor:

Acceptance Rates

POPL '15 Paper Acceptance Rate 52 of 227 submissions, 23%;
Overall Acceptance Rate 860 of 4,328 submissions, 20%

Upcoming Conference

POPL '26

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)157
  • Downloads (Last 6 weeks)12
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Comparing semantic graph representations of source code: The case of automatic feedback on programming assignmentsComputer Science and Information Systems10.2298/CSIS230615004P21:1(117-142)Online publication date: 2024
  • (2024)Generating Function Names to Improve Comprehension of Synthesized Programs2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VL/HCC60511.2024.00035(248-259)Online publication date: 2-Sep-2024
  • (2024)Enhancing Identifier Naming Through Multi-Mask Fine-Tuning of Language Models of Code2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM63643.2024.00017(71-82)Online publication date: 7-Oct-2024
  • (2024)Debugging convergence problems in probabilistic programs via program representation learning with SixthSenseInternational Journal on Software Tools for Technology Transfer (STTT)10.1007/s10009-024-00737-226:3(249-268)Online publication date: 1-Jun-2024
  • (2023)DeMinify: Neural Variable Name Recovery and Type InferenceProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616368(758-770)Online publication date: 30-Nov-2023
  • (2023)AdaptivePaste: Intelligent Copy-Paste in IDEProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613895(1844-1854)Online publication date: 30-Nov-2023
  • (2023)An Accurate Identifier Renaming Prediction and Suggestion ApproachACM Transactions on Software Engineering and Methodology10.1145/360310932:6(1-51)Online publication date: 29-Sep-2023
  • (2023)Discrete Adversarial Attack to Models of CodeProceedings of the ACM on Programming Languages10.1145/35912277:PLDI(172-195)Online publication date: 6-Jun-2023
  • (2023)Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional NetworksACM Transactions on Software Engineering and Methodology10.1145/354294432:2(1-43)Online publication date: 30-Mar-2023
  • (2023)Fold2Vec: Towards a Statement-Based Representation of Code for Code ComprehensionACM Transactions on Software Engineering and Methodology10.1145/351423232:1(1-31)Online publication date: 13-Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media