ACM Home Page
Please provide us with feedback. Feedback
Versatility of extended subwords and the matrix register file
Full text PdfPdf (1.79 MB)
Source
ACM Transactions on Architecture and Code Optimization (TACO) archive
Volume 5 ,  Issue 1  (May 2008) table of contents
Article No. 5  
Year of Publication: 2008
ISSN:1544-3566
Authors
Asadollah Shahbahrami  Delft University of Technology, The Netherlands
Ben Juurlink  Delft University of Technology, The Netherlands
Stamatis Vassiliadis  Delft University of Technology, The Netherlands
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 66,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1369396.1369401
What is a DOI?

ABSTRACT

Extended subwords and the matrix register file (MRF) are two micro architectural techniques that address some of the limitations of existing SIMD architectures. Extended subwords are wider than the data stored in memory. Specifically, for every byte of data stored in memory, there are four extra bits in the media register file. This avoids the need for data-type conversion instructions. The MRF is a register file organization that provides both conventional row-wise, as well as column-wise, access to the register file. In other words, it allows to view the register file as a matrix in which corresponding subwords in different registers corresponds to a column of the matrix. It was introduced to accelerate matrix transposition which is a very common operation in multimedia applications. In this paper, we show that the MRF is very versatile, since it can also be used for other permutations than matrix transposition. Specifically, it is shown how it can be used to provide efficient access to strided data, as is needed in, e.g., color space conversion. Furthermore, it is shown that special-purpose instructions (SPIs), such as the sum-of-absolute differences (SAD) instruction, have limited usefulness when extended subwords and a few general SIMD instructions that we propose are supported, for the following reasons. First, when extended subwords are supported, the SAD instruction provides only a relatively small performance improvement. Second, the SAD instruction processes 8-bit subwords only, which is not sufficient for quarter-pixel resolution nor for cost functions used in image and video retrieval. Results obtained by extending the SimpleScalar toolset show that the proposed techniques provide a speedup of up to 3.00 over the MMX architecture. The results also show that using, at most, 13 extra media registers yields an additional performance improvement ranging from 1.38 to 1.57.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Austin, T., Larson, E., and Ernst, D. 2002. SimpleScalar: An infrastructure for computer system modeling. IEEE Comput. 35, 2, 59--67.
 
2
Baron, M. 2005. Cortex-A8: High speed, low power. Microprocessor Rep. 11, 14, 1--6.
 
3
Bartkowiak, M. 2001. Optimizations of color transformation for real time video decoding. In Proceedings of the EURASIP Conference on Digital Signal Processing for Multimedia Communications and Services.
 
4
Bensaali, F. and Amira, A. 2005. Accelerating colour space conversion on reconfigurable hardware. Image Vision Comput. 23, 935--942.
 
5
Chatterji, S., Narayanan, M., Duell, J., and Oliker, L. 2003. Performance evaluation of two emerging media processors: VIRAM and Imagine. In Proceedings of the 14th IEEE International Symposium on Parallel and Distributed Processing. 229--235.
 
6
Deb, S. 2005. Video Data Management and Information Retrieval. IRM Press, Hershey, Pennsylvania, USA.
 
7
Diefendorff, K., Dubey, P. K., Hochsprung, R., and Scales, H. 2000. AltiVec extension to powerPC accelerates media processing. IEEE Micro 20, 2, 85--95.
 
8
Flachs, B., Asano, S., Dhong, S. H., Hofstee, H. P., Gervais, G., Kim, R., Le, T., Liu, P., Leenstra, J., Michael, J. L. B., Oh, H. J., Mueller, S. M., Takahashi, O., Hatakeyama, A., Watanabe, Y., Yano, N., Brokenshire, D. A., Peyravian, M., Vandung, T., and Iwata, E. 2006. The microarchitecture of the synergistic processor for a cell processor. IEEE J. Solid-State Circuits 41, 63--70.
 
9
Goodacre, J. and Sloss, A. N. 2005. Parallelism and the ARM instruction set architecture. IEEE Comput. 38, 7, 42--50.
 
10
Gschwind, M., Hofstee, H. P., Flachs, B., Hopkins, M., Watanabe, Y., and Yamazaki, T. 2006. Synergistic processing in cell's multicore architecture. IEEE Micro 26, 2, 10--24.
 
11
Gwennap, L. 1996. Digital, MIPS add multimedia extensions. Microprocessor Rep. 10, 15, 24--28.
 
12
Huang, L., Lai, M., Dai, K., Yue, H., and Shen, L. 2007. Hardware support for arithmetic units of processor with multimedia extension. In Proceedings of the IEEE International Conference on Multimedia and Ubiquitous Engineering. 633--637.
 
13
IBM 2007. Synergistic Processor Unit Instruction Set Architecture. IBM. Version 1.2.
 
14
Jennings, M. D. and Conte, T. M. 1998. Subword extensions for video processing on mobile systems. IEEE Concurrency 6, 3, 13--16.
 
15
Juurlink, B., Borodin, D., Meeuws, R. J., Aalbers, G. T., and Leisink, H. 2007. The SimpleScalar Instruction Tool (SSIT) and the SimpleScalar Architecture Tool (SSAT). Available via http://ce.et.tudelft.nl/~shahbahrami
 
16
Kozyrakis, C., Gebis, J., Martin, D., Williams, S., Mavroidis, I., Pope, S., Jones, D., Patterson, D., and Yelick, K. 2000. Vector IRAM: A media-oriented vector processor with embedded DRAM. In Proceedings of the 12th International Conference on Hot Chips.
 
17
Kuhn, P. 1999. Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation. Kluwer Academic Publ. Boston, MA.
 
18
Larsen, S. and Amarasinghe, S. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM Conference on Programming Language Design and Implementation. 145--156.
 
19
Lee, A. J. T., Hong, R. W., and Chang, M. F. 2004. An approach to content-based video retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo. Vol. 1. 273--276.
 
20
Lee, J., Vijaykrishnan, N., Irwin, M. J., and Wolf, W. 2004. An architecture for motion estimation in the transform domain. In Proceedings of the 17th IEEE International Conference on VLSI Design.
 
21
Lee, R. B. and Smith, M. D. 1996. Media processing: A new design target. IEEE Micro 16, 4, 6--9.
 
22
Moreno, J. H., Zyuban, V., Shvadron, U., Neeser, F. D., Derby, J. H., Ware, M. S., Kailas, K., Zaks, A., Geva, A., Ben-David, S., Asaad, S. W., Fox, T. W., Littrell, D., Biberstein, M., Naishlos, D., and Hunter, H. 2003. An innovative low-power high-performance programmable signal processor for digital communications. IBM J. Res. Develop. 47, 2/3, 299--326.
 
23
Motorola Inc. 1998. AltiVec Technology Programming Environments Manual. Motorola Inc. Rev.0.1.
 
24
Naishlos, D., Biberstein, M., David, S. B., and Zaks, A. 2003. Vectorizing for a SIMdD DSP Architecture. In International Conference on Compilers, Architectures and Synthesis for Embedded Systems. 2--11.
 
25
Peleg, A., Wiljie, S., and Weiser, U. 1997. Intel MMX for Multimedia PCs. Commun. ACM 40, 1, 24--38.
 
26
Poynton, C. 1996. A Technical Introduction to Digital Video. Wiley, New York.
 
27
Rabbani, M. and Jones, P. W. 1991. Digital Image Compression Techniques. Bellinghan, Washington.
 
28
Raman, S. K., Pentkovski, V., and Keshava, J. 2000. Implementing streaming SIMD extensions on the Pentium 3 processor. IEEE Micro 20, 4, 47--57.
 
29
Seshan, N. 1998. High VelociTI Processing. IEEE Signal Processing Mag. 15, 2, 86--101.
 
30
Shahbahrami, A., Juurlink, B., Borodin, D., and Vassiliadis, S. 2006a. Avoiding conversion and rearrangement overhead in SIMD architectures. Intern. J. Parallel Programming 34, 3, 237--260.
 
31
Shahbahrami, A., Juurlink, B., and Vassiliadis, S. 2006b. Accelerating color space conversion using extended subwords and the matrix register file. In Proceedings of the 8th IEEE International Symposium on Multimedia. 37--46.
 
32
Shahbahrami, A., Juurlink, B., and Vassiliadis, S. 2006c. Limitations of special-purpose instructions for similarity measurements in media SIMD extensions. In Proceedings of the ACM International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 293--303.
 
33
Shanableh, T. and Ghanbari, M. 2000. Heterogeneous video transcoding to lower spatio-temporal resolutions and different encoding formats. IEEE Trans. Multimedia 2, 2, 101--110.
 
34
Slingerland, N. and Smith, A. J. 2002. Measuring the performance of multimedia instruction sets. IEEE Trans. Comput. 51, 11, 1317--1332.
 
35
Tamhankar, A. and Rao, K. R. 2003. An overview of H.264/MPEG-4 Part 10. In Proceedings of the 4th International Conference on Video and Image Processing and Multimedia Communications. 1--51.
 
36
Texas Instruments 2007. TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide. Texas Instruments. Literature Number: SPRU732D.
 
37
Tremblay, M., 0'Connor, J. M., Narayanan, V., and He, L. 1996. VIS speeds new media processing. IEEE Micro 16, 4, 10--20.
 
38
Wang, L., Zhang, Y., and Feng, J. 2005. On the euclidean distance of images. IEEE Trans. Pattern Anal. Machine Intell. 27, 8, 1334--1339.
 
39
Zhang, D. and Lu, G. 2003. Evaluation of similarity measurement for image rretrieval. In Proceedings of the IEEE International Conference on Neural Networks and Signal Processing. Vol. 2. 928--931.

Collaborative Colleagues:
Asadollah Shahbahrami: colleagues
Ben Juurlink: colleagues
Stamatis Vassiliadis: colleagues