ACM Home Page
Please provide us with feedback. Feedback
Analyzing block locality in Morton-order and Morton-hybrid matrices
Full text PdfPdf (516 KB)
Source
ACM SIGARCH Computer Architecture News archive
Volume 35 ,  Issue 4  (September 2007) table of contents
SPECIAL ISSUE: Medea 2006 workshop table of contents
Pages 6-12  
Year of Publication: 2007
ISSN:0163-5964
Authors
K. Patrick Lorton  Schrodinger, New York, NY
David S. Wise  Indiana University, Bloomington, IN
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 66,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1327312.1327315
What is a DOI?

ABSTRACT

As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and Fortran, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from L1 to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
Bader, M., and Zenger, C. Cache oblivious matrix multiplication using an element ordering based on the Peano curve. In Parallel Processing and Applied Mathematics (Berlin, 2006), vol. 3911 of Lecture Notes in Comput. Sci., Springer, pp. 1042--1049. http://dx.doi.org/10.1007/11752578_126
 
4
5
 
6
Fox, G. C. A graphical approach to load balancing and sparse matrix-vector multiplication. In Numerical Algorithms for Modern Parallel Architectures, M. Schultz, Ed., vol. 13 of IMA Vol. in Math. & Appl. Springer, New York, 1988, pp. 37--61.
7
 
8
 
9
Gabriel, S. T., Chenoweth, B., Lorton, K. P., Carlson, M., and Wise, D. S. The Opie Compiler Distribution. Indiana University, Bloomington, IN, Apr. 2005. http://www/cs.indiana.edu/~dswise/Opie/distribution.html
10
11
 
12
 
13
Goto, K., and van de Geijn, R. On reducing TLB misses in matrix multiplication. FLAME Working Note 9, Univ. of Texas, Austin, Nov. 2002. http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gz
 
14
Goto, K., and van de Geijn, R. A. Anatomy of high-performance matrix multiplication. Tech. rep., Univ. of Texas, Austin. Submittted for publication. Visited Sept. 2006. http://www.cs.utexas.edu/users/flame/pubs/GOTO_TOMS.pdf
 
15
Innovative Computing Laboratory, Univ. of Tennessee. Performance Application Programming Interface (PAPI). Knoxville, TN, Dec. 2005. http://icl.cs.utk.edu/papi/
 
16
Johnson, D. S. A theoretician's guide to the experimental analysis of algorithms. In Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges, M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds., vol. 59 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc, Providence, 2002, pp. 215--250. http://www.research.att.com/~dsj/papers.html
 
17
 
18
Markoff, J. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times CLV, 53, 412 (2005 Nov. 28), C1, C6. http://www.nytimes.com/2005/11/28/technology/28super.html
 
19
Morton, G. M. A computer oriented geodetic data base and a new technique in file sequencing. Tech. rep., IBM Ltd., Ottawa, Ontario, Mar. 1966.
 
20
 
21
 
22
 
23
 
24
Valsalam, V., and Skjellum, A. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concur. Comp. Prac. Exper. 14, 10 (2002), 805--839. http://dx.doi.org/10.1002/cpe.630
 
25
 
26
Wise, D. S., Citro, C. L., Hursey, J. J., Liu, F., and Rainey, M. A. A paradigm for parallel matrix algorithms: Scalable Cholesky. In Euro-Par 2005 --- Parallel Processing, J. C. Cunha and P. D. Medeiros, Eds., no. 3648 in Lecture Notes in Comput. Sci. Springer, Berlin, Aug. 2005, pp. 687--698. http://dx.doi.org/10.1007/11549468_76
27

Collaborative Colleagues:
K. Patrick Lorton: colleagues
David S. Wise: colleagues