Abstract
This article presents the first optimal algorithm for trace scheduling. The trace is a global scheduling region used by compilers to exploit instruction-level parallelism across basic block boundaries. Several heuristic techniques have been proposed for trace scheduling, but the precision of these techniques has not been studied relative to optimality. This article describes a technique for finding provably optimal trace schedules, where optimality is defined in terms of a weighted sum of schedule lengths across all code paths in a trace. The optimal algorithm uses branch-and-bound enumeration to efficiently explore the entire solution space. Experimental evaluation of the algorithm shows that, with a time limit of 1 s per problem, 91% of the hard trace scheduling problems in the SPEC CPU 2006 Integer Benchmarks are solved optimally. For 58% of these hard problems, the optimal schedule is improved compared to that produced by a heuristic scheduler with a geometric mean improvement of 3.2% in weighted schedule length and 18% in compensation code size.
- Bernstein, D. and Rodeh, M. 1991. Global scheduling for super-scalar machines. In Proceedings of the Conference on Programming Language Design and Implementation. ACM, New York. Google ScholarDigital Library
- Bharadwaj, J., Menezes, K., and Mckinsey, C. 2000. Wavefront scheduling: path based data representation and scheduling of subgraphs. J. Instruc.-Level Paral., 2, 7.Google Scholar
- Chekuri, C., Johnson, R., Motwani, R., Natarajan, B., Rau, B., and Schlansker, M. 1996. Profile-driven instruction-level parallel scheduling with applications to superblocks. In Proceedings of the 29th International Symposium on Microarchitecture. IEEE, Los Alamitos, CA. Google ScholarDigital Library
- Chou, H. and Chung, C. 1995. An optimal instruction scheduler for superscalar processor. IEEE Trans. Parall. Distrib. Syst. 6, 303--313. Google ScholarDigital Library
- Fisher, J. 1981. Trace Scheduling: a technique for global micro-code compaction. IEEE Trans. Comput. 30, 478--490. Google ScholarDigital Library
- Freudenberger, S., Gross, T., and Lowney, P. 1994. Avoidance and suppression of compensation code in a trace scheduling compiler. ACM Trans. Program. Lang. Syst. 16, 1156--1214. Google ScholarDigital Library
- Havanki, W., Banerjia, S., and Conte, T., 1998. Treegion scheduling for wide-issue processors. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture (HPCA-4), IEEE, Los Alamitos, CA. Google ScholarDigital Library
- Hennessy, J. and Gross, T. 1983. Post-pass code optimization of pipeline constraints. ACM Trans. Program. Lang. Syst. 5, 422--448. Google ScholarDigital Library
- Hwu, W., Mahlke, S., Chen, W., Chang, P., Warter, N., Bringmann, R., Ouellette, R., Hank, R., Kiyohara, T., Haab, G., Holm J., and Lavery, D. 1993. The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 229--248. Google ScholarDigital Library
- Lah, J. and Atkin, D. 1983. Tree compaction of micro-programs. In Proceedings of the 16th Annual Workshop on Micro-Programming. ACM, New York.Google Scholar
- Langevin, M. and Cerny, E. 1996. A recursive technique for computing lower-bound performance of schedules. ACM Trans. Des. Autom. Electron. Syst. 1, 443--456. Google ScholarDigital Library
- Linn, J. 1983. SRDAG Compaction—a generalization of trace scheduling to increase the use of global context information. In Proceedings of the 16th Annual Workshop on Micro-Programming. ACM, New York.Google Scholar
- Mahlke, S., Lin, D., Chen, W., Hank, R., and Bringmann, R. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE, Los Alamitos, CA. Google ScholarDigital Library
- Muchnick, S. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Narasimhan, M. and Ramanujam, J. 2001. A fast approach to computing exact solutions to the resource-constrained scheduling problem. ACM Trans. Des. Autom. Electron. Syst. 6, 490--500. Google ScholarDigital Library
- Rim, M. and Jain, R. 1994. Lower-bound performance estimation for the high-level synthesis scheduling problem. IEEE Trans. Comput. Aid. Des. Integr. Circ. Syst. 13, 451--458.Google ScholarDigital Library
- Shobaki, G. and Wilken, K. 2004. Optimal superblock scheduling using enumeraion. In Proceedings of the 37th International Symposium on Microarchitecture. IEEE, Los Alamitos, CA. Google ScholarDigital Library
- Shobaki, G. 2006. Optimal global instruction scheduling using enumeration. PhD dissertation, Dept. of Computer Science, University of California, Davis. Google ScholarDigital Library
- Smith, M., Horowitz, M., and Lam, M. 1992. Efficient superscalar performance through boosting. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York. Google ScholarDigital Library
- Su, B., Ding, S., and Jin, L. 1984. An improvement of trace scheduling for global microcode compaction. In Proceedings of the 17th Annual Workshop on Microprogramming. ACM, New York. Google ScholarDigital Library
- Sun Microsystems, 2004. UltraSPARC Processors Documentation, www.sun.com/processors/documentation.html.Google Scholar
- Wilken, K., Liu, J., and Heffernan, M. 2000. Optimal instruction scheduling using integer programming. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York. Google ScholarDigital Library
- Winkel S. 2007. Optimal versus heuristic global code scheduling. In Proceedings of the 40th International Symposium on Microarchitecture. IEEE, Los Alamitos, CA. Google ScholarDigital Library
- Wolsey, L. 1998. Integer Programming. John Wiley and Sons, New York.Google Scholar
Index Terms
- Optimal trace scheduling using enumeration
Recommendations
Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach
Balancing Instruction-Level Parallelism (ILP) and register pressure during preallocation instruction scheduling is a fundamentally important problem in code generation and optimization. The problem is known to be NP-complete. Many heuristic techniques ...
Exploring an Alternative Cost Function for Combinatorial Register-Pressure-Aware Instruction Scheduling
Multiple combinatorial algorithms have been proposed for doing pre-allocation instruction scheduling with the objective of minimizing register pressure or balancing register pressure and instruction-level parallelism. The cost function that is minimized ...
Warp-aware trace scheduling for GPUs
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilationGPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization ...
Comments