COBAYN: Compiler Autotuning Framework Using Bayesian Networks

Authors:
Amir Hossein Ashouri

Politecnico di Milano

Politecnico di Milano
View Profile

,
Giovanni Mariani

IBM, Dwingeloo, the Netherlands

IBM, Dwingeloo, the Netherlands
View Profile

,
Gianluca Palermo

Politecnico di Milano

Politecnico di Milano
View Profile

,
Eunjung Park

Los Alamos National Laboratory, USA

Los Alamos National Laboratory, USA
View Profile

,
John Cavazos

University of Delaware, DE, USA

University of Delaware, DE, USA
View Profile

,
Cristina Silvano

Politecnico di Milano

Politecnico di Milano
View Profile

ACM Transactions on Architecture and Code Optimization Volume 13 Issue 2Article No.: 21pp 1–25https://doi.org/10.1145/2928270

Published:14 June 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

The variety of today’s architectures forces programmers to spend a great deal of time porting and tuning application codes across different platforms. Compilers themselves need additional tuning, which has considerable complexity as the standard optimization levels, usually designed for the average case and the specific target architecture, often fail to bring the best results.

This article proposes COBAYN: Compiler autotuning framework using BAYesian Networks, an approach for a compiler autotuning methodology using machine learning to speed up application performance and to reduce the cost of the compiler optimization phases. The proposed framework is based on the application characterization done dynamically by using independent microarchitecture features and Bayesian networks. The article also presents an evaluation based on using static analysis and hybrid feature collection approaches. In addition, the article compares Bayesian networks with respect to several state-of-the-art machine-learning models.

Experiments were carried out on an ARM embedded platform and GCC compiler by considering two benchmark suites with 39 applications. The set of compiler configurations, selected by the model (less than 7% of the search space), demonstrated an application performance speedup of up to 4.6 × on Polybench (1.85 × on average) and 3.1 × on cBench (1.54 × on average) with respect to standard optimization levels. Moreover, the comparison of the proposed technique with (i) random iterative compilation, (ii) machine learning--based iterative compilation, and (iii) noniterative predictive modeling techniques shows, on average, 1.2 × , 1.37 × , and 1.48 × speedup, respectively. Finally, the proposed method demonstrates 4 × and 3 × speedup, respectively, on cBench and Polybench in terms of exploration efficiency given the same quality of the solutions generated by the random iterative compilation model.

References

Felix Agakov, Edwin Bonilla, John Cavazos, Björn Franke, Grigori Fursin, Michael FP O’Boyle, John Thomson, Marc Toussaint, and Christopher KI Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 295--305. Google ScholarDigital Library
Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09). ACM, New York, NY, 38--49. DOI:http://dx.doi.org/10.1145/1542476.1542481 Google ScholarDigital Library
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 303--316. Google ScholarDigital Library
Amir Hossein Ashouri, Andrea Bignoli, Gianluca Palermo, and Cristina Silvano. 2016. Predictive modeling methodology for compiler phase-ordering. In Proceedings of 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 5th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms. ACM. Google ScholarDigital Library
Amir Hossein Ashouri, Giovanni Mariani, Gianluca Palermo, and Cristina Silvano. 2014. A Bayesian network approach for compiler auto-tuning for embedded processors. In IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia’14). IEEE, 90--97.Google ScholarCross Ref
Amir Hossein Ashouri, Vittorio Zaccaria, Sotirios Xydis, Gianluca Palermo, and Cristina Silvano. 2013. A framework for compiler level statistical analysis over customized VLIW architecture. In 2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC’13). IEEE, 124--129.Google ScholarCross Ref
Rajendra Bhatia. 2009. Positive Definite Matrices. Princeton University Press, Princeton, NJ. Google ScholarDigital Library
François Bodin, Toru Kisuki, Peter Knijnenburg, Mike O’Boyle, and Erven Rohou. 1998. Iterative compilation in a non-linear optimisation space. In Workshop on Profile and Feedback-Directed Compilation.Google Scholar
Karsten M. Borgwardt and Hans-Peter Kriegel. 2005. In Data Mining, 5th IEEE International Conference on Shortest-path Kernels on Graphs. IEEE, 8--pp. Google ScholarDigital Library
John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F. P. O’Boyle, and Olivier Temam. 2007. Rapidly selecting good compiler optimizations using performance counters. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07). IEEE Computer Society, Washington, DC, 185--197. DOI:http://dx.doi.org/10.1109/CGO.2007.32 Google ScholarDigital Library
Yang Chen, Shuangde Fang, Yuanjie Huang, Lieven Eeckhout, Grigori Fursin, Olivier Temam, and Chengyong Wu. 2012. Deconstructing iterative optimization. ACM Transactions on Architecture and Code Optimization 9, 3, 21. Google ScholarDigital Library
Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. In ACM SIGPLAN Notices, Vol. 34. ACM, 1--9. Google ScholarDigital Library
Yufei Ding, Jason Ansel, Kalyan Veeramachaneni, Xipeng Shen, Una-May OReilly, and Saman Amarasinghe. 2015. Autotuning algorithmic choice for input sensitivity. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 379--390. Google ScholarDigital Library
Shuangde Fang, Wenwen Xu, Yang Chen, Lieven Eeckhout, Olivier Temam, Yunji Chen, Chengyong Wu, and Xiaobing Feng. 2015. Practical iterative optimization for the data center. ACM Transactions on Architecture and Code Optimization 12, 2, 15. Google ScholarDigital Library
Grigori Fursin. 2010. Collective benchmark (cbench), a collection of open-source programs with multiple datasets assembled by the community to enable realistic benchmarking and research on program and architecture optimization.Google Scholar
Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, and others. 2011. Milepost gcc: Machine learning enabled self-tuning compiler. International Journal of Parallel Programming 39, 3, 296--327.Google ScholarCross Ref
Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, and others. 2008. MILEPOST GCC: Machine learning based research compiler. In GCC Summit.Google Scholar
Richard L. Gorsuch. 1988. Exploratory factor analysis. In Handbook of Multivariate Experimental Psychology, John R. Nesselroade and Raymond B. Cattell (Eds.). Springer, 231--258.Google Scholar
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar’12). IEEE, 1--10.Google Scholar
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1, 10--18. Google ScholarDigital Library
David Heckerman and David M. Chickering. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. In Machine Learning. 20--197. Google ScholarDigital Library
Torsten Hoefler and Roberto Belli. 2015. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 73. Google ScholarDigital Library
Kenneth Hoste and Lieven Eeckhout. 2007. Microarchitecture-independent workload characterization. IEEE Micro 27, 3, 63--72. Google ScholarDigital Library
Kenneth Hoste and Lieven Eeckhout. 2008. Cole: Compiler optimization level exploration. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 165--174. Google ScholarDigital Library
Zhanpeng Jin and A. C. Cheng. 2008. Improve simulation efficiency using statistical benchmark subsetting - An implantbench case study. In 45th ACM/IEEE Design Automation Conference (DAC 2008). 970--973. Google ScholarDigital Library
Henry F. Kaiser. 1958. The varimax criterion for analytic rotation in factor analysis. Psychometrika 23, 3, 187--200.Google ScholarCross Ref
Christos Kartsaklis, Oscar Hernandez, Chung-Hsing Hsu, Thomas Ilsche, Wayne Joubert, and Richard L. Graham. 2012. HERCULES: A pattern driven code transformation system. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW’’12). IEEE, 574--583. Google ScholarDigital Library
Toru Kisuki, Peter M. W. Knijnenburg, Mike F. P. O’Boyle, François Bodin, and Harry A. G. Wijshoff. 1999. A feasibility study in iterative compilation. In High Performance Computing, Constantine Polychronopoulos and Kazuki Joe Akira Fukuda (Eds.). Springer, 121--132. Google ScholarDigital Library
Toru Kisuki, Peter M. W. Knijnenburg, and Michael F. P. O’Boyle. 2000. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques (PACT’00), Philadelphia, PA, October 15--19, 2000. 237--248. DOI:http://dx.doi.org/10.1109/PACT.2000.888348 Google ScholarDigital Library
Prasad A. Kulkarni, David B. Whalley, Gary S. Tyson, and Jack W. Davidson. 2009. Practical exhaustive optimization phase order exploration and evaluation. ACM Transactions on Architecture and Code Optimization 6, 1, Article 1, 36 pages. DOI:http://dx.doi.org/10.1145/1509864.1509865 Google ScholarDigital Library
Sameer Kulkarni and John Cavazos. 2012. Mitigating the compiler optimization phase-ordering problem using machine learning. ACM SIGPLAN Notices 47, 10, 147--162. Google ScholarDigital Library
Hugh Leather, Edwin Bonilla, and Michael O’Boyle. 2009. Automatic feature generation for machine learning based optimizing compilation. In International Symposium on Code Generation and Optimization (CGO’09). IEEE, 81--91. Google ScholarDigital Library
Junghsi Lee and V. John Mathews. 1994. A stability condition for certain bilinear systems. IEEE Transactions on Signal Processing 42, 7, 1871--1873. Google ScholarDigital Library
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). ACM, New York, NY, 190--200. DOI:http://dx.doi.org/10.1145/1065010.1065034 Google ScholarDigital Library
Luiz G. A. Martins, Ricardo Nobre, João M. P. Cardoso, Alexandre C. B. Delbem, and Eduardo Marques. 2016. Clustering-based selection for the exploration of compiler optimization sequences. ACM Transactions on Architecture and Code Optimization 13, 1, 8. Google ScholarDigital Library
Sanyam Mehta and Pen-Chung Yew. 2015. Improving compiler scalability: Optimizing large programs at small price. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 143--152. Google ScholarDigital Library
Kevin P. Murphy. 2001. The Bayes net toolbox for MATLAB. Computing Science and Statistics 33, 2001.Google Scholar
Gianluca Palermo, Cristina Silvano, and Vittorio Zaccaria. 2005. Multi-objective design space exploration of embedded systems. Journal of Embedded Computing 1, 3, 305--316. Google ScholarDigital Library
EunJung Park, John Cavazos, and Marco A. Alvarez. 2012. Using graph-based program characterization for predictive modeling. In Proceedings of the 10th International Symposium on Code Generation and Optimization. ACM, 196--206. Google ScholarDigital Library
Eunjung Park, John Cavazos, Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and P. Sadayappan. 2013. Predictive modeling in a polyhedral optimization space. International Journal of Parallel Programming 41, 5, 704--750.Google ScholarCross Ref
EunJung Park, Christos Kartsaklis, and John Cavazos. 2014. HERCULES: Strong patterns towards more intelligent predictive modeling. In 43rd International Conference on Parallel Processing (ICPP’14). IEEE, 172--181.Google ScholarCross Ref
Louis-Noël Pouchet. 2012. Polybench: The polyhedral benchmark suite. http://www.cs.ucla.edu/&sim;pouchet/software/polybench/.Google Scholar
Suresh Purini and Lakshya Jain. 2013. Finding good optimization sequences covering program space. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 56. Google ScholarDigital Library
R Foundation. 2012. The R Project for Statistical Computing. (2012). http://www.r-project.org/.Google Scholar
Roberto Santana, Concha Bielza, Pedro Larranaga, Jose A. Lozano, Carlos Echegoyen, Alexander Mendiburu, Rubén Armananzas, and Siddartha Shakya. 2010. Mateda-2.0: Estimation of distribution algorithms in MATLAB. Journal of Statistical Software 35, 7, 1--30.Google ScholarCross Ref
Eric Schkufza, Rahul Sharma, and Alex Aiken. 2014. Stochastic optimization of floating-point programs with tunable precision. ACM SIGPLAN Notices 49, 6, 53--64. Google ScholarDigital Library
Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O’Reilly. 2003. Meta optimization: Improving compiler heuristics with machine learning. In ACM SIGPLAN Notices, Vol. 38. ACM, 77--90. Google ScholarDigital Library
Mirai Tanaka and Kazuhide Nakata. 2014. Positive definite matrix approximation with condition number constraint. Optimization Letters 8, 3, 939--947.Google ScholarCross Ref
Texas Instruments. 2012. PandaBoard. OMAP4430 SoC Dev. Board, Revision A 2.Google Scholar
Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huyng, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). IEEE, 136--145. Google ScholarCross Ref
Bruce Thompson. 2002. Statistical, practical, and clinical: How many kinds of significance do counselors need to consider? Journal of Counseling & Development 80, 1, 64--71.Google ScholarCross Ref

Index Terms

COBAYN: Compiler Autotuning Framework Using Bayesian Networks

Recommendations

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity ...
Read More
Power Modeling for GPU Architectures Using McPAT

Graphics Processing Units (GPUs) are very popular for both graphics and general-purpose applications. Since GPUs operate many processing units and manage multiple levels of memory hierarchy, they consume a significant amount of power. Although several ...
Read More
Selecting Heterogeneous Cores for Diversity

Mobile devices with heterogeneous processors are becoming mainstream. With a heterogeneous processor, the runtime scheduler can pick the best CPU core for a given task based on program characteristics, performance requirements, and power limitations. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 13, Issue 2
June 2016
200 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2952301
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
- Accepted: 1 April 2016
- Revised: 1 March 2016
- Received: 1 November 2015
Published in taco Volume 13, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bayesian networks
design space exploration
statistical inference
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 65
  Total Citations
  View Citations
- 2,178
  Total Downloads
- Downloads (Last 12 months)868
- Downloads (Last 6 weeks)678
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

COBAYN: Compiler Autotuning Framework Using Bayesian Networks

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Power Modeling for GPU Architectures Using McPAT

Selecting Heterogeneous Cores for Diversity