research-article

Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign

Authors:
Tony Nowatzki

University of California

University of California
View Profile

,
Newsha Ardalani

SimpleMachines, Inc.

SimpleMachines, Inc.
View Profile

,
Karthikeyan Sankaralingam

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

,
Jian Weng

University of California

University of California
View Profile

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation TechniquesNovember 2018Article No.: 36Pages 1–15https://doi.org/10.1145/3243176.3243212

Published:01 November 2018Publication History

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Pages 1–15

ABSTRACT

Recent programmable accelerators are faster and more energy efficient than general purpose processors, but expose complex hardware/software abstractions for compilers. A key problem is instruction scheduling, which requires sophisticated algorithms for mapping instructions to distributed processing elements, routing of operand dependences, and timing the arrival of operands to enable high throughput.

The complex dependences between mapping, communication and timing make prior scheduling techniques insufficient. Optimization-based approaches are too slow, and heuristic-based approaches cannot achieve high quality. Our first insight is that the algorithm can be solved in a series of phases with overlapping responsibilities to reduce complexity. Second, it is possible to combine optimization-based and stochastic-heuristic based search strategies, to exploit the best features of both. This leads to the two primary techniques we explore, phase overlapping and hybridization.

In this work we explore the codesign of scheduling algorithms with a challenging-to-schedule programmable accelerator. We show we can improve its area by 35% by trimming its scheduling-friendly structures, using a scheduling algorithm that is 5× faster than the state-of-the-art optimization-based scheduler, with up to 2× better throughput.

References

{n. d.}. GAMS, http://www.gams.com/. ({n. d.}).Google Scholar
{n. d.}. IBM ILOG CPLEX, https://www.ibm.com/us-en/marketplace/ibm-ilog-cplex. ({n. d.}).Google Scholar
Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. 1995. Software Pipelining. ACM Comput. Surv. 27, 3 (Sept. 1995), 367--432. Google ScholarDigital Library
S. Amarasinghe, D. R. Karger, W. Lee, and V. S. Mirrokni. 2002. A Theoretical and Practical Approach to Instruction Scheduling on Spatial Architectures. Technical Report. MIT.Google Scholar
S. Amellal and B. Kaminska. 1994. Functional synthesis of digital systems with TASS. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 13, 5 (may 1994), 537 --552. Google ScholarDigital Library
Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing Hardware in a Scala Embedded Language. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). ACM, New York, NY, USA, 1216--1225. Google ScholarDigital Library
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News (2011). Google ScholarDigital Library
HJ Caulfield, WT Rhodes, MJ Foster, and Sam Horvitz. 1981. Optical implementation of systolic array processing. Optics Communications 40, 2 (1981), 86--90.Google ScholarCross Ref
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 269--284. Google ScholarDigital Library
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 367--379. Google ScholarDigital Library
Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. 2004. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization. In MICRO. Google ScholarDigital Library
Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Composable Accelerator-rich Microprocessor Enhanced for Adaptivity and Longevity. In ISLPED. Google ScholarDigital Library
Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012. CHARM: A Composable Heterogeneous Accelerator-rich Microprocessor. In ISPLED. Google ScholarDigital Library
Jason Cong, Karthik Gururaj, Guoling Han, and Wei Jiang. 2009. Synthesis algorithm for application-specific homogeneous processor networks. IEEE Trans. Very Large Scale Integr. Syst. 17, 9 (Sept. 2009). Google ScholarDigital Library
J. Cong, H. Huang, and M. A. Ghodrat. 2016. A scalable communication-aware compilation flow for programmable accelerators. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). 503--510.Google Scholar
Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, and Peipei Zhou. 2014. A fully pipelined and dynamically composable architecture of CGRA. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on. IEEE, 9--16. Google ScholarDigital Library
Katherine E. Coons, Xia Chen, Doug Burger, Kathryn S. McKinley, and Sundeep K. Kushwaha. 2006. A spatial path scheduling algorithm for EDGE architectures. SIGARCH Comput. Archit. News 34, 5 (Oct. 2006), 129--140. Google ScholarDigital Library
John R. Ellis. 1985. Bulldog: a compiler for vliw architectures. Ph.D. Dissertation. Google ScholarDigital Library
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. International Journal of Parallel Programming 21 (1992), 313--347. Issue 5. Google ScholarDigital Library
Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (Sept. 2012), 38--51. Google ScholarDigital Library
Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankar-alingam. 2013. Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG. In PACT. Google ScholarDigital Library
Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. 2011. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO. Google ScholarDigital Library
Zhining Huang, Sharad Malik, Nahri Moreano, and Guido Araujo. 2004. The design of dynamically reconfigurable datapath coprocessors. ACM Trans. Embed. Comput. Syst. 3, 2 (May 2004), 361--384. Google ScholarDigital Library
Rajeev Joshi, Greg Nelson, and Keith Randall. 2002. Denali: a goal-directed superoptimizer. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation (PLDI '02). 304--314. Google ScholarDigital Library
Krishnan Kailas, Ashok Agrawala, and Kemal Ebcioglu. 2001. CARS: A New Code Generation Framework for Clustered ILP Processors. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01). 133--. http://dl.acm.org/citation.cfm?id=580550.876436 Google ScholarDigital Library
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Re-configurable Interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 461--475. Google ScholarDigital Library
Monica Sin-Ling Lam. 1987. A Systolic Array Optimizing Compiler. Ph.D. Dissertation. Pittsburgh, PA, USA. AAI8814722.Google Scholar
Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. 1998. Space-time Scheduling of Instruction-level Parallelism on a Raw Machine. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). ACM, New York, NY, USA, 46--57. Google ScholarDigital Library
D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 14--26.Google Scholar
Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In International Conference on Field Programmable Logic and Applications. Springer, 61--70.Google ScholarCross Ref
B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. IEE Proceedings - Computers and Digital Techniques 150, 5 (Sept 2003), 255--61--.Google ScholarCross Ref
Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, Mark Oskin, and Susan J. Eggers. 2006. Instruction scheduling for a tiled dataflow architecture. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS XII). 141--150. Google ScholarDigital Library
Ramadass Nagarajan, Sundeep K. Kushwaha, Doug Burger, Kathryn S. McKinley, Calvin Lin, and Stephen W. Keckler. 2004. Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04). 74--84. Google ScholarDigital Library
Chris Nicol. 2017. A Coarse Grain Reconfigurable Array (CGRA) for Statically Scheduled Data Flow Computing. Wave Computing White Paper (2017).Google Scholar
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 416--429. Google ScholarDigital Library
Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright. 2016. Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 27--39.Google ScholarCross Ref
Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A General Constraint-centric Scheduling Framework for Spatial Architectures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 495--506. Google ScholarDigital Library
Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages. Google ScholarDigital Library
Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages. Google ScholarDigital Library
Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture (MICRO 31). 308--315. http://dl.acm.org/citation.cfm?id=290940.291004 Google ScholarDigital Library
Jens Palsberg and MpSOC Mayur Naik. 2004. ILP-based Resource-aware Compilation. (2004).Google Scholar
Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-programmed Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 142--153. Google ScholarDigital Library
Hyunchul Park, Kevin Fan, Scott A. Mahlke, Taewook Oh, Heeseok Kim, and Hong-seok Kim. 2008. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). 166--176. Google ScholarDigital Library
Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. 2017. Scale-out Acceleration for Machine Learning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 367--381. Google ScholarDigital Library
Yongjun Park, Hyunchul Park, and Scott Mahlke. 2009. CGRA Express: Accelerating Execution Using Dynamic Operation Fusion. In Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '09). ACM, New York, NY, USA, 271--280. Google ScholarDigital Library
Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, and Scott Mahlke. 2012. Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 84--95. Google ScholarDigital Library
Phitchaya Mangpo Phothilimthana, Tikhon Jelvis, Rohin Shah, Nishant Totla, Sarah Chasins, and Rastislav Bodik. 2014. Chlorophyll: Synthesisaided Compiler for Low-power Spatial Architectures. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 396--407. Google ScholarDigital Library
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For Parallel Paterns. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 389--402. Google ScholarDigital Library
Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 110--119.Google ScholarCross Ref
Roddy Urquhart and Will Moore and Andrew McCabe. 1987. Systolic Arrays. Institute of Physics Publishing.Google Scholar
Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert McDonald, Rajagopalan Desikan, Saurabh Drolia, M.S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. 2006. Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. In MICRO. Google ScholarDigital Library
Nadathur Satish, Kaushik Ravindran, and Kurt Keutzer. 2007. A decomposition-based constraint optimization approach for statically scheduling task graphs with communication delays to multiprocessors. In DATE '07. Google ScholarDigital Library
Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 291--. Google ScholarDigital Library
Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro 22, 2 (March 2002), 25--35. Google ScholarDigital Library
J. J. Tithi, N. C. Crago, and J. S. Emer. 2014. Exploiting spatial architectures for edit distance algorithms. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 23--34.Google Scholar
Dani Voitsechov and Yoav Etsion. 2014. Single-graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 205--216. Google ScholarDigital Library
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17). ACM, New York, NY, USA, Article 29, 6 pages. Google ScholarDigital Library

Recommendations

Effective instruction scheduling with limited registers
Read More
Lazy instruction scheduling: keeping performance, reducing power
ISLPED '08: Proceedings of the 2008 international symposium on Low Power Electronics & Design

An important approach to reduce power dissipation is reducing the number of instructions executed by the processor. To achieve this goal, this paper introduces a novel instruction scheduling algorithm that executes an instruction only when its result is ...
Read More
Compiler optimization on VLIW instruction scheduling for low power

In this article, we investigate compiler transformation techniques regarding the problem of scheduling VLIW instructions aimed at reducing power consumption of VLIW architectures in the instruction bus. The problem can be categorized into two types: ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
November 2018
494 pages
ISBN:9781450359863
DOI:10.1145/3243176
General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 447
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

ABSTRACT

References

Cited By

Recommendations

Effective instruction scheduling with limited registers

Lazy instruction scheduling: keeping performance, reducing power

Compiler optimization on VLIW instruction scheduling for low power

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

ABSTRACT

References

Cited By

Recommendations

Effective instruction scheduling with limited registers

Lazy instruction scheduling: keeping performance, reducing power

Compiler optimization on VLIW instruction scheduling for low power

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media