skip to main content
10.1145/2925426.2926269acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Peruse and Profit: Estimating the Accelerability of Loops

Published: 01 June 2016 Publication History

Abstract

There exist a multitude of execution models available today for a developer to target. The choices vary from general purpose processors to fixed-function hardware accelerators with a large number of variations in-between. There is a growing demand to assess the potential benefits of porting or rewriting an application to a target architecture in order to fully exploit the benefits of performance and/or energy efficiency offered by such targets. However, as a first step of this process, it is necessary to determine whether the application has characteristics suitable for acceleration.
In this paper, we present Peruse, a tool to characterize the features of loops in an application and to help the programmer understand the amenability of loops for acceleration. We consider a diverse set of features ranging from loop characteristics (e.g., loop exit points) and operation mixes (e.g., control vs data operations) to wider code region characteristics (e.g., idempotency, vectorizability). Peruse is language, architecture, and input independent and uses the intermediate representation of compilers to do the characterization. Using static analyses makes Peruse scalable and enables analysis of large applications to identify and extract interesting loops suitable for acceleration. We show analysis results for unmodified applications from the SPEC CPU benchmark suite, Polybench, and HPC workloads.
For an end-user it is more desirable to get an estimate of the potential speedup due to acceleration. We use the workload characterization results of Peruse as features and develop a machine-learning based model to predict the potential speedup of a loop when off-loaded to a fixed function hardware accelerator. We use the model to predict the speedup of loops selected by Peruse and achieve an accuracy of 79%.

References

[1]
T. C. A. Kotha and R. Barua. Aesop: The autoparallelizing compiler for shared memory computers. Technical report, Department of Electrical and Computer Engineering, University of Maryland, College Park, April 2013.
[2]
D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine learning, 6(1):37--66, 1991.
[3]
M. Annavaram, R. Rakvic, M. Polito, J. Y. Bouguet, R. Hankins, and B. Davies. The fuzzy correlation between code and performance predictability. In Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pages 93--104, Dec 2004.
[4]
N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu. Cross-architecture performance prediction (xapp) using cpu code to predict gpu performance. In Proceedings of the 48th International Symposium on Microarchitecture, pages 725--737. ACM, 2015.
[5]
N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, and Y. Zhou. Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE '07, pages 1--8, New York, NY, USA, 2007. ACM.
[6]
I. Baldini, S. J. Fink, and E. Altman. Predicting gpu performance from cpu runs using machine learning. In Proceedings of the 26th International Symposium on Computer Architecture and High Performance Computing, pages 114--122. ACM, 2006.
[7]
T. Ball and J. R. Larus. Efficient Path Profiling. In PROC of the 1996 MICRO, 1996.
[8]
M. Boyer, J. Meng, and K. Kumaran. Improving gpu performance prediction with data transfer modeling. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International, pages 1097--1106, May 2013.
[9]
M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. In Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction, SIGPLAN '86, pages 162--175, New York, NY, USA, 1986. ACM.
[10]
P. Chen, L. Zhang, Y.-H. Han, and Y.-J. Chen. A general-purpose many-accelerator architecture based on dataflow graph clustering of applications. Journal of Computer Science and Technology, 29(2):239--246, 2014.
[11]
C. B. codes. Coral collaboration - oak ridge, argonne, livermore, 2013.
[12]
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995.
[13]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM.
[14]
M. de Kruijf and K. Sankaralingam. Idempotent code generation: Implementation, analysis, and evaluation. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--12. IEEE, Feb. 2013.
[15]
M. A. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12, volume 47, page 475, New York, New York, USA, June 2012. ACM Press.
[16]
L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Quantifying the impact of input data sets on program behavior and its applications. Journal of Instruction-Level Parallelism, 5(1):1--33, 2003.
[17]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 449--460. IEEE Computer Society, 2012.
[18]
T. Fahringer. Automatic Performance Prediction of Parallel Programs. Springer Publishing Company, Incorporated, 1st edition, 2011.
[19]
Y. Freund and L. Mason. The alternating decision tree learning algorithm. In ICML, volume 99, pages 124--133, 1999.
[20]
K. Ganesan, L. John, V. Salapura, and J. Sexton. A performance counter based workload characterization on blue gene/p. In Parallel Processing, 2008. ICPP'08. 37th International Conference on, pages 330--337. IEEE, 2008.
[21]
S. Garcia, D. Jeon, C. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI '11: Proceedings of the Conference on Programming Language Design and Implementation, 2011.
[22]
G. Goff, K. Kennedy, and C.-W. Tseng. Practical dependence testing. ACM SIGPLAN Notices, 26(6):15--29, June 1991.
[23]
V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 32(5):0038--51, 2012.
[24]
Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures.
[25]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10--18, 2009.
[26]
R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 37--47, 2010.
[27]
J. Holewinski, R. Ramamurthi, M. Ravishankar, N. Fauzia, L.-N. Pouchet, A. Rountev, and P. Sadayappan. Dynamic trace-based analysis of vectorization potential of applications. In PLDI, pages 371--382. ACM, 2012.
[28]
G. Holmes, B. Pfahringer, R. Kirkby, E. Frank, and M. Hall. Multiclass alternating decision trees. In Machine Learning: ECML 2002, pages 161--172. Springer, 2002.
[29]
K. Hoste and L. Eeckhout. Microarchitecture-independent workload characterization. IEEE Micro, 27(3):63--72, 2007.
[30]
K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and K. De Bosschere. Performance prediction based on inherent program similarity. In Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 114--122. ACM, 2006.
[31]
E. Ipek, B. R. de Supinski, M. Schulz, and S. A. McKee. An approach to performance prediction for parallel applications. In Proceedings of the 11th International Euro-Par Conference on Parallel Processing, Euro-Par'05, pages 196--205, Berlin, Heidelberg, 2005. Springer-Verlag.
[32]
D. Jeon, S. Garcia, C. Louie, and M. B. Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA '11: Conference on Object-Oriented Programming, Systems, Language and Applications, 2011.
[33]
M. Kawahito, H. Komatsu, T. Moriyama, H. Inoue, and T. Nakatani. A new idiom recognition framework for exploiting hardware-assist instructions. ACM SIGPLAN Notices, 41(11):382, Oct. 2006.
[34]
M. A. Kim and S. A. Edwards. Computation vs. memory systems: pinning down accelerator bottlenecks. In Computer Architecture, pages 86--98. Springer, 2012.
[35]
O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: accelerating index traversals for in-memory databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 468--479. ACM, 2013.
[36]
C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, and K. Yelick. Vector iram: A media-oriented vector processor with embedded dram. In Proc. Hot Chips XII, 2000.
[37]
D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '81, pages 207--218, New York, NY, USA, 1981. ACM.
[38]
L. L. N. Lab. Livermore unstructured lagrangian explicit shock hydrodynamics (lulesh) - https://codesign.llnl.gov/lulesh.php, 2013.
[39]
J. Lau, J. Sampson, E. Perelman, G. Hamerly, and B. Calder. The strong correlation between code signatures and performance. In Performance Analysis of Systems and Software, 2005. ISPASS 2005. IEEE International Symposium on, pages 236--247, March 2005.
[40]
P. Lokuciejewski, D. Cordes, H. Falk, and P. Marwedel. A fast and precise static loop analysis based on abstract interpretation, program slicing and polytope models. In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 136--146. IEEE Computer Society, 2009.
[41]
J. Menon, M. De Kruijf, and K. Sankaralingam. igpu: exception support and speculative execution on gpus. In ACM SIGARCH Computer Architecture News, volume 40, pages 72--83. IEEE Computer Society, 2012.
[42]
R. Merritt. Arm cto: power surge could createâĂŹdark siliconâĂŹ. EE Times, Oct, 2009.
[43]
M. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. Modeling and predicting application performance on hardware accelerators. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 73--73, Nov 2011.
[44]
G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. PaceâĂŤa toolset for the performance prediction of parallel and distributed systems. International Journal of High Performance Computing Applications, 14(3):228--251, 2000.
[45]
C. Olschanowsky, A. Snavely, M. R. Meswani, and L. Carrington. PIR: PMaC's Idiom Recognizer. In 2010 39th International Conference on Parallel Processing Workshops, pages 189--196. IEEE, Sept. 2010.
[46]
V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, and T.-F. Ngai. Exploring speculative parallelism in spec2006. In ISPASS, pages 77--88. IEEE, 2009.
[47]
B. Pottenger and R. Eigenmann. Idiom recognition in the Polaris parallelizing compiler. In Proceedings of the 9th international conference on Supercomputing - ICS '95, pages 444--448, New York, New York, USA, July 1995. ACM Press.
[48]
L.-N. Pouchet and U. Bondugula. Polybench 3.2, 2013.
[49]
T. K. Prakash and L. Peng. Performance characterization of spec cpu2006 benchmarks on intel core 2 duo processor. ISAST Trans. Comput. Softw. Eng, 2(1):36--41, 2008.
[50]
P. Pratikakis, J. S. Foster, and M. Hicks. Locksmith: Practical static race detection for c. ACM Trans. Program. Lang. Syst., 33(1):3:1--3:55, Jan. 2011.
[51]
W. Pugh. Uniform techniques for loop optimization. In 5th International Conference on Supercomputing (ICS'91), pages 341--352. ACM, 1991.
[52]
M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. Sage: self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 13--24. ACM, 2013.
[53]
R. E. Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI'99, pages 1401--1406, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
[54]
P. B. Schneck. Automatic recognition of vector and parallel operations in a higher level language. In Proceedings of the ACM Annual Conference - Volume 2, ACM '72, pages 772--779, New York, NY, USA, 1972. ACM.
[55]
J. Sewall, J. Chhugani, C. Kim, N. Satish, and P. Dubey. Palm: Parallel architecture-friendly latch-free modifications to b+ trees on many-core processors, 2011.
[56]
E. M. Shaccour and M. M. Mansour. ELI-C A Loop-level Workload Characterization Tool. In Third International Workshop on Performance Analysis of Workload Optimized Systems (FastPath2014), Mar. 2014.
[57]
Y. S. Shao and D. Brooks. ISA-independent workload characterization and its implications for specialized architectures. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 245--255. IEEE, Apr. 2013.
[58]
Y. S. Shao and D. Brooks. ISA-independent workload characterization and its implications for specialized architectures. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 245--255. IEEE, Apr. 2013.
[59]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. SIGOPS Oper. Syst. Rev., 36(5):45--57, Oct. 2002.
[60]
H.-W. Tseng and D. M. Tullsen. Data-triggered threads: Eliminating redundant computation. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 181--192. IEEE, 2011.
[61]
M. N. Wegman and F. K. Zadeck. Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst., 13(2):181--210, Apr. 1991.
[62]
Wikipedia. Receiver operating characteristic: http://en.wikipedia.org/wiki/receiver_operating_characteristic.
[63]
Wikipedia. Student's t-test: http://en.wikipedia.org/wiki/student%27s_t-test.
[64]
M. J. Wolfe. Optimizing supercompilers for supercomputers. MIT press, 1990.
[65]
L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross. Navigating big data with high-throughput, energy-efficient data partitioning. In Proceedings of the 40th Annual International Symposium on Computer Architecture, pages 249--260. ACM, 2013.

Cited By

View all
  • (2023)Early DSE and Automatic Generation of Coarse-grained Merged AcceleratorsACM Transactions on Embedded Computing Systems10.1145/354607022:2(1-29)Online publication date: 24-Jan-2023
  • (2021)NOVIA: A Framework for Discovering Non-Conventional Inline AcceleratorsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480094(507-521)Online publication date: 18-Oct-2021
  • (2021)Characterizing Loop Acceleration in Heterogeneous Computing2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00059(445-455)Online publication date: Sep-2021
  • Show More Cited By
  1. Peruse and Profit: Estimating the Accelerability of Loops

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Accelerator
    2. machine learning
    3. static analysis

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Early DSE and Automatic Generation of Coarse-grained Merged AcceleratorsACM Transactions on Embedded Computing Systems10.1145/354607022:2(1-29)Online publication date: 24-Jan-2023
    • (2021)NOVIA: A Framework for Discovering Non-Conventional Inline AcceleratorsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480094(507-521)Online publication date: 18-Oct-2021
    • (2021)Characterizing Loop Acceleration in Heterogeneous Computing2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00059(445-455)Online publication date: Sep-2021
    • (2019)Would it be Profitable Enough to Re-adapt Algorithmic Thinking for Parallelism Paradigm2019 2nd International Conference on new Trends in Computing Sciences (ICTCS)10.1109/ICTCS.2019.8923085(1-6)Online publication date: Oct-2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media