skip to main content
10.1145/3243176.3243212acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign

Published:01 November 2018Publication History

ABSTRACT

Recent programmable accelerators are faster and more energy efficient than general purpose processors, but expose complex hardware/software abstractions for compilers. A key problem is instruction scheduling, which requires sophisticated algorithms for mapping instructions to distributed processing elements, routing of operand dependences, and timing the arrival of operands to enable high throughput.

The complex dependences between mapping, communication and timing make prior scheduling techniques insufficient. Optimization-based approaches are too slow, and heuristic-based approaches cannot achieve high quality. Our first insight is that the algorithm can be solved in a series of phases with overlapping responsibilities to reduce complexity. Second, it is possible to combine optimization-based and stochastic-heuristic based search strategies, to exploit the best features of both. This leads to the two primary techniques we explore, phase overlapping and hybridization.

In this work we explore the codesign of scheduling algorithms with a challenging-to-schedule programmable accelerator. We show we can improve its area by 35% by trimming its scheduling-friendly structures, using a scheduling algorithm that is 5× faster than the state-of-the-art optimization-based scheduler, with up to 2× better throughput.

References

  1. {n. d.}. GAMS, http://www.gams.com/. ({n. d.}).Google ScholarGoogle Scholar
  2. {n. d.}. IBM ILOG CPLEX, https://www.ibm.com/us-en/marketplace/ibm-ilog-cplex. ({n. d.}).Google ScholarGoogle Scholar
  3. Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. 1995. Software Pipelining. ACM Comput. Surv. 27, 3 (Sept. 1995), 367--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Amarasinghe, D. R. Karger, W. Lee, and V. S. Mirrokni. 2002. A Theoretical and Practical Approach to Instruction Scheduling on Spatial Architectures. Technical Report. MIT.Google ScholarGoogle Scholar
  5. S. Amellal and B. Kaminska. 1994. Functional synthesis of digital systems with TASS. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 13, 5 (may 1994), 537 --552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing Hardware in a Scala Embedded Language. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). ACM, New York, NY, USA, 1216--1225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. HJ Caulfield, WT Rhodes, MJ Foster, and Sam Horvitz. 1981. Optical implementation of systolic array processing. Optics Communications 40, 2 (1981), 86--90.Google ScholarGoogle ScholarCross RefCross Ref
  9. Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 269--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 367--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. 2004. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Composable Accelerator-rich Microprocessor Enhanced for Adaptivity and Longevity. In ISLPED. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012. CHARM: A Composable Heterogeneous Accelerator-rich Microprocessor. In ISPLED. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jason Cong, Karthik Gururaj, Guoling Han, and Wei Jiang. 2009. Synthesis algorithm for application-specific homogeneous processor networks. IEEE Trans. Very Large Scale Integr. Syst. 17, 9 (Sept. 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Cong, H. Huang, and M. A. Ghodrat. 2016. A scalable communication-aware compilation flow for programmable accelerators. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). 503--510.Google ScholarGoogle Scholar
  16. Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, and Peipei Zhou. 2014. A fully pipelined and dynamically composable architecture of CGRA. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on. IEEE, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Katherine E. Coons, Xia Chen, Doug Burger, Kathryn S. McKinley, and Sundeep K. Kushwaha. 2006. A spatial path scheduling algorithm for EDGE architectures. SIGARCH Comput. Archit. News 34, 5 (Oct. 2006), 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. John R. Ellis. 1985. Bulldog: a compiler for vliw architectures. Ph.D. Dissertation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. International Journal of Parallel Programming 21 (1992), 313--347. Issue 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (Sept. 2012), 38--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankar-alingam. 2013. Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. 2011. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zhining Huang, Sharad Malik, Nahri Moreano, and Guido Araujo. 2004. The design of dynamically reconfigurable datapath coprocessors. ACM Trans. Embed. Comput. Syst. 3, 2 (May 2004), 361--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rajeev Joshi, Greg Nelson, and Keith Randall. 2002. Denali: a goal-directed superoptimizer. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation (PLDI '02). 304--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Krishnan Kailas, Ashok Agrawala, and Kemal Ebcioglu. 2001. CARS: A New Code Generation Framework for Clustered ILP Processors. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01). 133--. http://dl.acm.org/citation.cfm?id=580550.876436 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Re-configurable Interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 461--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Monica Sin-Ling Lam. 1987. A Systolic Array Optimizing Compiler. Ph.D. Dissertation. Pittsburgh, PA, USA. AAI8814722.Google ScholarGoogle Scholar
  28. Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. 1998. Space-time Scheduling of Instruction-level Parallelism on a Raw Machine. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). ACM, New York, NY, USA, 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 14--26.Google ScholarGoogle Scholar
  30. Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In International Conference on Field Programmable Logic and Applications. Springer, 61--70.Google ScholarGoogle ScholarCross RefCross Ref
  31. B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. IEE Proceedings - Computers and Digital Techniques 150, 5 (Sept 2003), 255--61--.Google ScholarGoogle ScholarCross RefCross Ref
  32. Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, Mark Oskin, and Susan J. Eggers. 2006. Instruction scheduling for a tiled dataflow architecture. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS XII). 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ramadass Nagarajan, Sundeep K. Kushwaha, Doug Burger, Kathryn S. McKinley, Calvin Lin, and Stephen W. Keckler. 2004. Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04). 74--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chris Nicol. 2017. A Coarse Grain Reconfigurable Array (CGRA) for Statically Scheduled Data Flow Computing. Wave Computing White Paper (2017).Google ScholarGoogle Scholar
  35. Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 416--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright. 2016. Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 27--39.Google ScholarGoogle ScholarCross RefCross Ref
  37. Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A General Constraint-centric Scheduling Framework for Spatial Architectures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 495--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2014. A Scheduling Framework for Spatial Architectures Across Multiple Constraint-Solving Theories. ACM Trans. Program. Lang. Syst. 37, 1, Article 2 (Nov. 2014), 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture (MICRO 31). 308--315. http://dl.acm.org/citation.cfm?id=290940.291004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jens Palsberg and MpSOC Mayur Naik. 2004. ILP-based Resource-aware Compilation. (2004).Google ScholarGoogle Scholar
  42. Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-programmed Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 142--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hyunchul Park, Kevin Fan, Scott A. Mahlke, Taewook Oh, Heeseok Kim, and Hong-seok Kim. 2008. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). 166--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. 2017. Scale-out Acceleration for Machine Learning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 367--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yongjun Park, Hyunchul Park, and Scott Mahlke. 2009. CGRA Express: Accelerating Execution Using Dynamic Operation Fusion. In Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '09). ACM, New York, NY, USA, 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, and Scott Mahlke. 2012. Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 84--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Phitchaya Mangpo Phothilimthana, Tikhon Jelvis, Rohin Shah, Nishant Totla, Sarah Chasins, and Rastislav Bodik. 2014. Chlorophyll: Synthesisaided Compiler for Low-power Spatial Architectures. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 396--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For Parallel Paterns. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 389--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 110--119.Google ScholarGoogle ScholarCross RefCross Ref
  50. Roddy Urquhart and Will Moore and Andrew McCabe. 1987. Systolic Arrays. Institute of Physics Publishing.Google ScholarGoogle Scholar
  51. Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert McDonald, Rajagopalan Desikan, Saurabh Drolia, M.S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. 2006. Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Nadathur Satish, Kaushik Ravindran, and Kurt Keutzer. 2007. A decomposition-based constraint optimization approach for statically scheduling task graphs with communication delays to multiprocessors. In DATE '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 291--. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro 22, 2 (March 2002), 25--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. J. J. Tithi, N. C. Crago, and J. S. Emer. 2014. Exploiting spatial architectures for edit distance algorithms. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 23--34.Google ScholarGoogle Scholar
  56. Dani Voitsechov and Yoav Etsion. 2014. Single-graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 205--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17). ACM, New York, NY, USA, Article 29, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
    November 2018
    494 pages
    ISBN:9781450359863
    DOI:10.1145/3243176

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 November 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate121of471submissions,26%

    Upcoming Conference

    PACT '24
    International Conference on Parallel Architectures and Compilation Techniques
    October 14 - 16, 2024
    Southern California , CA , USA

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader