ABSTRACT
Recent programmable accelerators are faster and more energy efficient than general purpose processors, but expose complex hardware/software abstractions for compilers. A key problem is instruction scheduling, which requires sophisticated algorithms for mapping instructions to distributed processing elements, routing of operand dependences, and timing the arrival of operands to enable high throughput.
The complex dependences between mapping, communication and timing make prior scheduling techniques insufficient. Optimization-based approaches are too slow, and heuristic-based approaches cannot achieve high quality. Our first insight is that the algorithm can be solved in a series of phases with overlapping responsibilities to reduce complexity. Second, it is possible to combine optimization-based and stochastic-heuristic based search strategies, to exploit the best features of both. This leads to the two primary techniques we explore, phase overlapping and hybridization.
In this work we explore the codesign of scheduling algorithms with a challenging-to-schedule programmable accelerator. We show we can improve its area by 35% by trimming its scheduling-friendly structures, using a scheduling algorithm that is 5× faster than the state-of-the-art optimization-based scheduler, with up to 2× better throughput.
- {n. d.}. GAMS, http://www.gams.com/. ({n. d.}).Google Scholar
- {n. d.}. IBM ILOG CPLEX, https://www.ibm.com/us-en/marketplace/ibm-ilog-cplex. ({n. d.}).Google Scholar
- Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. 1995. Software Pipelining. ACM Comput. Surv. 27, 3 (Sept. 1995), 367--432. Google ScholarDigital Library
- S. Amarasinghe, D. R. Karger, W. Lee, and V. S. Mirrokni. 2002. A Theoretical and Practical Approach to Instruction Scheduling on Spatial Architectures. Technical Report. MIT.Google Scholar
- S. Amellal and B. Kaminska. 1994. Functional synthesis of digital systems with TASS. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 13, 5 (may 1994), 537 --552. Google ScholarDigital Library
- Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing Hardware in a Scala Embedded Language. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). ACM, New York, NY, USA, 1216--1225. Google ScholarDigital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News (2011). Google ScholarDigital Library
- HJ Caulfield, WT Rhodes, MJ Foster, and Sam Horvitz. 1981. Optical implementation of systolic array processing. Optics Communications 40, 2 (1981), 86--90.Google ScholarCross Ref
- Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 269--284. Google ScholarDigital Library
- Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 367--379. Google ScholarDigital Library
- Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. 2004. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization. In MICRO. Google ScholarDigital Library
- Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Composable Accelerator-rich Microprocessor Enhanced for Adaptivity and Longevity. In ISLPED. Google ScholarDigital Library
- Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012. CHARM: A Composable Heterogeneous Accelerator-rich Microprocessor. In ISPLED. Google ScholarDigital Library
- Jason Cong, Karthik Gururaj, Guoling Han, and Wei Jiang. 2009. Synthesis algorithm for application-specific homogeneous processor networks. IEEE Trans. Very Large Scale Integr. Syst. 17, 9 (Sept. 2009). Google ScholarDigital Library
- J. Cong, H. Huang, and M. A. Ghodrat. 2016. A scalable communication-aware compilation flow for programmable accelerators. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). 503--510.Google Scholar
- Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, and Peipei Zhou. 2014. A fully pipelined and dynamically composable architecture of CGRA. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on. IEEE, 9--16. Google ScholarDigital Library
- Katherine E. Coons, Xia Chen, Doug Burger, Kathryn S. McKinley, and Sundeep K. Kushwaha. 2006. A spatial path scheduling algorithm for EDGE architectures. SIGARCH Comput. Archit. News 34, 5 (Oct. 2006), 129--140. Google ScholarDigital Library
- John R. Ellis. 1985. Bulldog: a compiler for vliw architectures. Ph.D. Dissertation. Google ScholarDigital Library
- Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. International Journal of Parallel Programming 21 (1992), 313--347. Issue 5. Google ScholarDigital Library
- Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (Sept. 2012), 38--51. Google ScholarDigital Library
- Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankar-alingam. 2013. Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG. In PACT. Google ScholarDigital Library
- Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. 2011. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO. Google ScholarDigital Library
- Zhining Huang, Sharad Malik, Nahri Moreano, and Guido Araujo. 2004. The design of dynamically reconfigurable datapath coprocessors. ACM Trans. Embed. Comput. Syst. 3, 2 (May 2004), 361--384. Google ScholarDigital Library
- Rajeev Joshi, Greg Nelson, and Keith Randall. 2002. Denali: a goal-directed superoptimizer. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation (PLDI '02). 304--314. Google ScholarDigital Library
- Krishnan Kailas, Ashok Agrawala, and Kemal Ebcioglu. 2001. CARS: A New Code Generation Framework for Clustered ILP Processors. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01). 133--. http://dl.acm.org/citation.cfm?id=580550.876436 Google ScholarDigital Library
- Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Re-configurable Interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 461--475. Google ScholarDigital Library
- Monica Sin-Ling Lam. 1987. A Systolic Array Optimizing Compiler. Ph.D. Dissertation. Pittsburgh, PA, USA. AAI8814722.Google Scholar
- Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. 1998. Space-time Scheduling of Instruction-level Parallelism on a Raw Machine. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). ACM, New York, NY, USA, 46--57. Google ScholarDigital Library
- D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 14--26.Google Scholar
- Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In International Conference on Field Programmable Logic and Applications. Springer, 61--70.Google ScholarCross Ref
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. IEE Proceedings - Computers and Digital Techniques 150, 5 (Sept 2003), 255--61--.Google ScholarCross Ref
- Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, Mark Oskin, and Susan J. Eggers. 2006. Instruction scheduling for a tiled dataflow architecture. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS XII). 141--150. Google ScholarDigital Library
- Ramadass Nagarajan, Sundeep K. Kushwaha, Doug Burger, Kathryn S. McKinley, Calvin Lin, and Stephen W. Keckler. 2004. Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04). 74--84. Google ScholarDigital Library
- Chris Nicol. 2017. A Coarse Grain Reconfigurable Array (CGRA) for Statically Scheduled Data Flow Computing. Wave Computing White Paper (2017).Google Scholar
- Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 416--429. Google ScholarDigital Library
- Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright. 2016. Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 27--39.Google ScholarCross Ref
- Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A General Constraint-centric Scheduling Framework for Spatial Architectures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 495--506. Google ScholarDigital Library
- Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages. Google ScholarDigital Library
- Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages. Google ScholarDigital Library
- Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture (MICRO 31). 308--315. http://dl.acm.org/citation.cfm?id=290940.291004 Google ScholarDigital Library
- Jens Palsberg and MpSOC Mayur Naik. 2004. ILP-based Resource-aware Compilation. (2004).Google Scholar
- Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-programmed Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 142--153. Google ScholarDigital Library
- Hyunchul Park, Kevin Fan, Scott A. Mahlke, Taewook Oh, Heeseok Kim, and Hong-seok Kim. 2008. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). 166--176. Google ScholarDigital Library
- Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. 2017. Scale-out Acceleration for Machine Learning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 367--381. Google ScholarDigital Library
- Yongjun Park, Hyunchul Park, and Scott Mahlke. 2009. CGRA Express: Accelerating Execution Using Dynamic Operation Fusion. In Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '09). ACM, New York, NY, USA, 271--280. Google ScholarDigital Library
- Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, and Scott Mahlke. 2012. Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 84--95. Google ScholarDigital Library
- Phitchaya Mangpo Phothilimthana, Tikhon Jelvis, Rohin Shah, Nishant Totla, Sarah Chasins, and Rastislav Bodik. 2014. Chlorophyll: Synthesisaided Compiler for Low-power Spatial Architectures. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 396--407. Google ScholarDigital Library
- Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For Parallel Paterns. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 389--402. Google ScholarDigital Library
- Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 110--119.Google ScholarCross Ref
- Roddy Urquhart and Will Moore and Andrew McCabe. 1987. Systolic Arrays. Institute of Physics Publishing.Google Scholar
- Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert McDonald, Rajagopalan Desikan, Saurabh Drolia, M.S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. 2006. Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. In MICRO. Google ScholarDigital Library
- Nadathur Satish, Kaushik Ravindran, and Kurt Keutzer. 2007. A decomposition-based constraint optimization approach for statically scheduling task graphs with communication delays to multiprocessors. In DATE '07. Google ScholarDigital Library
- Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 291--. Google ScholarDigital Library
- Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro 22, 2 (March 2002), 25--35. Google ScholarDigital Library
- J. J. Tithi, N. C. Crago, and J. S. Emer. 2014. Exploiting spatial architectures for edit distance algorithms. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 23--34.Google Scholar
- Dani Voitsechov and Yoav Etsion. 2014. Single-graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 205--216. Google ScholarDigital Library
- Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17). ACM, New York, NY, USA, Article 29, 6 pages. Google ScholarDigital Library
Recommendations
Lazy instruction scheduling: keeping performance, reducing power
ISLPED '08: Proceedings of the 2008 international symposium on Low Power Electronics & DesignAn important approach to reduce power dissipation is reducing the number of instructions executed by the processor. To achieve this goal, this paper introduces a novel instruction scheduling algorithm that executes an instruction only when its result is ...
Compiler optimization on VLIW instruction scheduling for low power
In this article, we investigate compiler transformation techniques regarding the problem of scheduling VLIW instructions aimed at reducing power consumption of VLIW architectures in the instruction bus. The problem can be categorized into two types: ...
Comments