Abstract
This paper introduces three new compiler transformations for representing and transforming sparse matrix computations and their data representations. In cooperation with run-time inspection, our compiler derives transformed matrix representations and associated transformed code to implement a variety of representations targeting different architecture platforms. This systematic approach to combining code and data transformations on sparse computations, which extends a polyhedral transformation and code generation framework, permits the compiler to compose these transformations with other transformations to generate code that is on average within 5% and often exceeds manually-tuned, high-performance sparse matrix libraries CUSP and OSKI. Additionally, the compiler-generated inspector codes are on average 1.5 faster than OSKI and perform comparably to CUSP, respectively.
- H. M. Aktulga, A. Buluç, S. Williams, and C. Yang. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, 2014. Google ScholarDigital Library
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, 2002. Google ScholarDigital Library
- C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Apr. 1991. Google ScholarDigital Library
- G. Arnold, J. Hölzl, A. S. Köksal, R. Bodík, and M. Sagiv. Specifying and verifying sparse matrix codes. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming (ICFP), ICFP ’10, pages 249–260, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.1, Argonne National Laboratory, 2010.Google Scholar
- A. Basumallik and R. Eigenmann. Optimizing irregular sharedmemory applications for distributed-memory systems. In Proceedings of the Symposium on Principles and Practice of Parallel Programming, 2006. Google ScholarDigital Library
- M. Belgin, G. Back, and C. J. Ribbens. Pattern-based Sparse Matrix Representation for Memory-Efficient SMVM Kernels. In ICS ’09, pages 100–109, 2009. Google ScholarDigital Library
- N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of SC ’09, Nov. 2009. Google ScholarDigital Library
- A. Bik and H. Wijshoff. On automatic data structure selection and code generation for sparse computations. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 57–75. Springer Berlin Heidelberg, 1994. Google ScholarDigital Library
- A. Bik and H. A. Wijshoff. Advanced compiler optimizations for sparse computations. In Supercomputing ’93 Proceedings, pages 430– 439, Nov 1993. Google ScholarDigital Library
- A. J. C. Bik. Compiler Support for Sparse Matrix Computations. PhD thesis, Leiden University, May 1996.Google Scholar
- A. J. C. Bik and H. A. G. Wijshoff. Automatic data structure selection and transformation for sparse matrix computations. IEEE Trans. Parallel Distrib. Syst., 7(2):109–126, 1996. Google ScholarDigital Library
- A. Buluç and J. R. Gilbert. Highly parallel sparse matrix-matrix multiplication. CoRR, abs/1006.2183, 2010.Google Scholar
- A. Buluç and J. R. Gilbert. The combinatorial blas: Design, implementation, and applications, 2010.Google Scholar
- C. Chen. Polyhedra scanning revisited. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI ’12, pages 499–508, June 2012. Google ScholarDigital Library
- T. Davis. The University of Florida Sparse Matrix Collection. NA Digest, 97, 1997.Google Scholar
- C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 229–241, New York, NY, USA, May 1999. ACM. Google ScholarDigital Library
- P. Feautrier. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, pages 79–103, 1996. Google ScholarDigital Library
- M. Hall, J. Chame, J. Shin, C. Chen, G. be Rudy, and M. M. Khan. Loop transformation recipes for code generation and auto-tuning. In LCPC, October, 2009. Google ScholarDigital Library
- H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7): 606–618, 2006. Google ScholarDigital Library
- W. A. Kelly. Optimization within a Unified Transformation Framework. PhD thesis, University of Maryland, Dec. 1996. Google ScholarDigital Library
- V. Kotlyar and K. Pingali. Sparse code generation for imperfectly nested loops with dependences. In Proceedings of the 11th International Conference on Supercomputing, ICS ’97, 1997. Google ScholarDigital Library
- V. Kotlyar, K. Pingali, and P. Stodghill. A relational approach to the compilation of sparse matrix programs. In C. Lengauer, M. Griebl, and S. Gorlatch, editors, Euro-Par’97 Parallel Processing, volume 1300 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 1997. Google ScholarDigital Library
- Y. Lin and D. Padua. Compiler analysis of irregular memory accesses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, May 2000. Google ScholarDigital Library
- X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pages 273–282, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2130-3. Google ScholarDigital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716–727, Apr. 2012. ISSN 2150-8097. Google ScholarDigital Library
- D. Lowell, J. Godwin, J. Holewinski, D. Karthik, C. Choudary, A. Mametjanov, B. Norris, G. Sabin, P. Sadayappan, and J. Sarich. Stencil-aware gpu optimization of iterative solvers. SIAM J. Scientific Computing, pages –1–1, 2013.Google Scholar
- N. Mateev, K. Pingali, P. Stodghill, and V. Kotlyar. Next-generation generic programming and its application to sparse matrix computations. In Proceedings of the 14th International Conference on Supercomputing, pages 88–99, Santa Fe, New Mexico, USA, May 2000. Google ScholarDigital Library
- J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix-vector product computations using unroll and jam. International Journal of High Performance Computing Applications, 18(2):225–236, 2004. Google ScholarDigital Library
- J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming, 29(3):217–247, 2001. Google ScholarDigital Library
- R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nico, and K. Crowley. Principles of runtime support for parallel processors. In Proceedings of the 2nd International Conference on Supercomputing, pages 140–152, 1988. Google ScholarDigital Library
- N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 192–202, October 1999. Google ScholarDigital Library
- C. Oancea and L. Rauchwerger. A hybrid approach to proving memory reference monotonicity. In S. Rajopadhye and M. Mills Strout, editors, Languages and Compilers for Parallel Computing, volume 7146 of Lecture Notes in Computer Science, pages 61–75. Springer Berlin Heidelberg, 2013.Google Scholar
- W. Pugh and T. Shpeisman. Sipr: A new framework for generating efficient code for sparse matrix computations. In Proceedings of the Eleventh International Workshop on Languages and Compilers for Parallel Computing, Chapel Hill, North Carolina, August 1998. Google ScholarDigital Library
- F. Quilleré and S. Rajopadhye. Generation of efficient nested loops from polyhedra. International Journal of Parallel Programming, 28 (5):469–498, Oct. 2000. Google ScholarDigital Library
- L. Rauchwerger and D. Padua. The lrpd test: speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, PLDI ’95, 1995. Google ScholarDigital Library
- M. Ravishankar, J. Eisenlohr, L.-N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In Proceedings of SC’12, November 2012. Google ScholarDigital Library
- G. Rudy, M. M. Khan, M. Hall, C. Chen, and C. Jacqueline. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd international conference on Languages and com pilers for parallel computing, LCPC’10, pages 136–150, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
- J. Saltz, C. Chang, G. Edjlali, Y.-S. Hwang, B. Moon, R. Ponnusamy, S. Sharma, A. Sussman, M. Uysal, G. Agrawal, R. Das, and P. Havlak. Programming irregular applications: Runtime support, compilation and tools. Advances in Computers, 45:105–153, 1997.Google ScholarCross Ref
- M. Shantharam, A. Chatterjee, and P. Raghavan. Exploiting dense substructures for fast sparse matrix vector multiplication. Int. J. High Perform. Comput. Appl., 25(3):328–341, Aug. 2011. Google ScholarDigital Library
- M. M. Strout, A. LaMielle, L. Carter, J. Ferrante, B. Kreaseck, and C. Olschanowsky. An approach for code generation in the sparse polyhedral framework. Technical Report CS-13-109, Colorado State University, December 2013.Google Scholar
- H. van der Spek and H. Wijshoff. Sublimation: Expanding data structures to enable data instance specific optimizations. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing (LCPC), Lecture Notes in Computer Science, pages 106–120. Springer Berlin / Heidelberg, 2010. Google ScholarDigital Library
- N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In Proceedings of the 15th International Conference on Compiler Construction, Mar. 2006. Google ScholarDigital Library
- A. Venkat, M. Shantharam, M. Hall, and M. M. Strout. Non-affine extensions to polyhedral code generation. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, 2014. Google ScholarDigital Library
- R. Vuduc and H. Moon. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure. In Proceedings of the High Performance Computing and Communications, volume 3726 of LNCS, pages 807–816. Springer, 2005. ISBN 978-3-540-29031-5. Google ScholarDigital Library
- R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(1):521–530, 2005.Google ScholarCross Ref
- R. W. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, 2003. Google ScholarDigital Library
- S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009. Google ScholarDigital Library
- S. Williams, N. Bell, J. Choi, M. Garland, L. Oliker, and R. Vuduc. Sparse matrix vector multiplication on multicore and accelerator systems. In J. Kurzak, D. A. Bader, and J. Dongarra, editors, Scientific Computing with Multicore Processors and Accelerators. CRC Press, 2010.Google ScholarCross Ref
- B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize noncoalesced memory accesses on gpu. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’13, 2013. Google ScholarDigital Library
Index Terms
- Loop and data transformations for sparse matrix code
Recommendations
Loop and data transformations for sparse matrix code
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationThis paper introduces three new compiler transformations for representing and transforming sparse matrix computations and their data representations. In cooperation with run-time inspection, our compiler derives transformed matrix representations and ...
Non-affine Extensions to Polyhedral Code Generation
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationThis paper describes a loop transformation framework that extends a polyhedral representation of loop nests to represent and transform computations with non-affine index arrays in loop bounds and subscripts via a new interface between compile-time and ...
Non-affine Extensions to Polyhedral Code Generation
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationThis paper describes a loop transformation framework that extends a polyhedral representation of loop nests to represent and transform computations with non-affine index arrays in loop bounds and subscripts via a new interface between compile-time and ...
Comments