skip to main content
10.1145/2491956.2462181acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Authors Info & Claims
Published:16 June 2013Publication History

ABSTRACT

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the libraries become too complicated to be used extensively in real applications. In this work we develop a Sparse Matrix-vector multiplication Auto-Tuning system (SMAT) to bridge the gap between specific optimizations and general-purpose usage. SMAT provides users with a unified programming interface in compressed sparse row (CSR) format and automatically determines the optimal format and implementation for any input sparse matrix at runtime. For this purpose, SMAT leverages a learning model, which is generated in an off-line stage by a machine learning method with a training set of more than 2000 matrices from the UF sparse matrix collection, to quickly predict the best combination of the matrix feature parameters. Our experiments show that SMAT achieves impressive performance of up to 51GFLOPS in single-precision and 37GFLOPS in double-precision on mainstream x86 multi-core processors, which are both more than 3 times faster than the Intel MKL library. We also demonstrate its adaptability in an algebraic multigrid solver from Hypre library with above 20% performance improvement reported.

References

  1. Intel Math Kernel Library. URL http://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  2. Data Mining Tools See5 and C5.0. URL http://www.rulequest.com/see5-info.html.Google ScholarGoogle Scholar
  3. M. D. Adam Hill. The international thermonuclear experimental reactor. Technical report, 2005.Google ScholarGoogle Scholar
  4. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49. ACM, 2009. ISBN 978-1-60558-392-1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Armstrong and A. Rendell. Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection. In Cluster Computing, 2008 IEEE International Conference on, pages 411--420, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Balay, J. Brown, , K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.3, Argonne National Laboratory, 2012.Google ScholarGoogle Scholar
  7. N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.Google ScholarGoogle Scholar
  8. G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 59:1--59:12, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-744-8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages 107--117. Elsevier Science Publishers B. V., 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Buluc, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 721--733. IEEE Computer Society, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 115--126. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. A. Davis. The university of florida sparse matrix collection. NA DIGEST, 92, 1994.Google ScholarGoogle Scholar
  13. R. Falgout. An introduction to algebraic multigrid computing. Computing in Science Engineering, 8(6):24--33, nov.-dec. 2006. ISSN 1521-9615. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. D. Falgout and U. M. Yang. hypre: a library of high performance preconditioners. In Preconditioners, Lecture Notes in Computer Science, pages 632--641, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. Special issue on "Program Generation, Optimization, and Platform Adaptation".Google ScholarGoogle ScholarCross RefCross Ref
  16. D. Grewe and A. Lokhmotov. Automatically generating and tuning gpu code for sparse matrix-vector multiplication from a high-level representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 12:1--12:8. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Harris. poski: An extensible autotuning framework to perform optimized spmvs on multicore architectures. 2009.Google ScholarGoogle Scholar
  18. M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley. An overview of the trilinos project. ACM Trans. Math. Softw., 31(3):397--423, Sept. 2005. ISSN 0098-3500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. jin Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 18:2004, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. F. Khairoutdinov and D. A. Randall. A cloud resolving model as a cloud parameterization in the ncar community climate system model: Preliminary results. Geophys. Res. Lett., 28(18):3617C3620, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  21. K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris. Csx: an extended compression format for spmv on shared memory systems. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 247--256. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Nagar and J. Bakos. A sparse matrix personality for the convey hc-1. In Field-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, pages 1--8, may 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. CUDA CUSPARSE Library. NVIDIA, 2010.Google ScholarGoogle Scholar
  24. M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  25. Y. Saad. Sparskit : a basic tool kit for sparse matrix computations. Technical Report, 1994.Google ScholarGoogle Scholar
  26. N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl., 13(2):50--64, May 2012. ISSN 1931-0145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Srinivasa and M. Sosonkina. Nonuniform memory affinity strategy in multithreaded sparse matrix computations. In Proceedings of the 2012 Symposium on High Performance Computing, HPC '12, pages 9:1--9:8, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. P. Stevenson, A. Firoozshahian, A. Solomatnikov, M. Horowitz, and D. Cheriton. Sparse matrix-vector multiply on the hicamp architecture. In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 195--204, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1316-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B.-Y. Su and K. Keutzer. clspmv: A cross-platform opencl spmv framework on gpus. In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 353--364. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X. Sun, Y. Zhang, T. Wang, G. Long, X. Zhang, and Y. Li. Crsd: application specific auto-tuning of spmv for diagonal sparse matrices. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, Euro-Par'11, pages 316--327, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC, J. Physics: Conf. Ser., volume 16, pages 521--530, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  32. R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the First international conference on High Performance Computing and Communications, HPCC'05, pages 807--816. Springer-Verlag, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '98, pages 1--27, Washington, DC, USA, 1998. IEEE Computer Society. ISBN0-89791-984-X. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC '07, pages 38:1--38:12. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput., 35(3):178--194, Mar. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X. Yang, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus: implications for graph mining. Proc. VLDB Endow., 4(4):231--242, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance blas?, 2005.Google ScholarGoogle Scholar

Index Terms

  1. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2013
        546 pages
        ISBN:9781450320146
        DOI:10.1145/2491956
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 48, Issue 6
          PLDI '13
          June 2013
          515 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2499370
          Issue’s Table of Contents

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 June 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PLDI '13 Paper Acceptance Rate46of267submissions,17%Overall Acceptance Rate406of2,067submissions,20%

        Upcoming Conference

        PLDI '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader