ABSTRACT
How to develop efficient and scalable parallel applications is the key challenge for emerging many-core architectures. We investigate this question by implementing and comparing two parallel H.264 decoders on the Cell architecture. It is expected that future many-cores will use a Cell-like local store memory hierarchy, rather than a non-scalable shared memory. The two implemented parallel algorithms, the Task Pool (TP) and the novel Ring-Line (RL) approach, both exploit macroblock-level parallelism. The TP implementation follows the master-slave paradigm and is very dynamic so that in theory perfect load balancing can be achieved. The RL approach is distributed and more predictable in the sense that the mapping of macroblocks to processing elements is fixed. This allows to better exploit data locality, to overlap communication with computation, and to reduce communication and synchronization overhead. While TP is more scalable in theory, the actual scalability favors RL. Using 16 SPEs, RL obtains a scalability of 12x, while TP achieves only 10.3x. More importantly, the absolute performance of RL is much higher. Using 16 SPEs, RL achieves a throughput of 139.6 frames per second (fps) while TP achieves only 76.6 fps. A large part of the additional performance advantage is due to hiding the memory latency. From the results we conclude that in order to fully leverage the performance of future many-cores, a centralized master should be avoided and the mapping of tasks to cores should be predictable in order to be able to hide the memory latency.
- International Standard of Joint Video Specification (ITU-T Rec. H.264| ISO/IEC 14496-10 AVC), 2005.Google Scholar
- M. Alvarez, A. Ramirez, A. Azevedo, C. Meenderinck, B. Juurlink, and M. Valero. Scalability of Macroblock-level Parallelism for H.264 Decoding. In Proc. Int. Conf. on Parallel and Distributed Systems, 2009. Google ScholarDigital Library
- M. Alvarez, E. Salami, A. Ramirez, and M. Valero. HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications. In Proc. IEEE Int. Symp. on Workload Characterization, 2007. Google ScholarDigital Library
- H. Baik, K. Sihn, Y. Kim, S. Bae, N. Han, and H. Song. Analysis and Parallelization of H.264 Decoder on Cell Broadband Engine Architecture. In Proc. Int. Symp. on Signal Processing and Information Technology. Samsung Electron. Co., 2007.Google ScholarCross Ref
- M. Baker, P. Dalale, K. Chatha, and S. Vrudhula. A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture. In Proc. IEEE/ACM Int. Conf. on Hardware/Software Codesign and System Synthesis, volume 7, 2009. Google ScholarDigital Library
- T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broadband Engine Architecture and its First Implementation: a Performance View. IBM Journal of Research and Development, 51(5), 2007. Google ScholarDigital Library
- Y. Chen, X. Tian, S. Ge, and M. Girkar. Towards Efficient Multi-Level Threading of H.264 Encoder on Intel Hyper-Threading Architectures. In Proc. Int. Parallel and Distributed Processing Symposium, volume 18, 2004.Google Scholar
- The FFmpeg Libavcodec. http://ffmpeg.org.Google Scholar
- A. Gulati and G. Campbell. Efficient Mapping of the H.264 Encoding Algorithm onto Multiprocessor DSPs. In Proc. SPIE Conf. on Embedded Processors for Multimedia and Communications, 2005.Google ScholarCross Ref
- J. Hoogerbrugge and A. Terechko. A Multithreaded Multicore System for Embedded Media Processing. Transactions on High-Performance Embedded Architectures and Compilers, 3(2), 2008.Google Scholar
- F. Khunjush and N. Dimopoulos. Extended Characterization of DMA Transfers on the Cell BE processor. In Proc. 13th Int. Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS-08), held in conjunction with IPDPS, 2008.Google ScholarCross Ref
- C. Meenderinck, A. Azevedo, B. Juurlink, M. Alvarez Mesa, and A. Ramirez. Parallel Scalability of Video Decoders. Journal of Signal Processing Systems, 57(2), 2009. Google ScholarDigital Library
- T. Oelbaum, V. Baroncini, T. Tan, and C. Fenimore. Subjective Quality Assessment of the Emerging AVC/H.264 Video Coding Standard. In Proc. Int. Broadcast Conf., 2004.Google Scholar
- D. Pham et al. The Design and Implementation of a First-Generation CELL Processor. In Proc. IEEE Int. Solid-State Circuits Conference (ISSCC), 2005.Google ScholarCross Ref
- A. Rodriguez, A. Gonzalez, and M. Malumbres. Hierarchical Parallelization of an H.264/AVC Video Encoder. In Proc. Int. Symp. on Parallel Computing in Electrical Engineering, 2006. Google ScholarDigital Library
- M. Roitzsch. Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding. In Proc. IEEE Real-Time Systems Symposium, volume 27, 2006.Google Scholar
- E. van der Tol, E. Jaspers, and R. Gelderblom. Mapping of H.264 Decoding on a Multiprocessor Architecture. In Proc. SPIE Conf. on Image and Video Communications and Processing, 2003.Google Scholar
- T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560--576, July 2003. Google ScholarDigital Library
- X264. A Free H.264/AVC Encoder. http://www.videolan.org/developers/x264.html.Google Scholar
- L. Zhao, R. Iyer, S. Makineni, J. Moses, R. Illikkal, and D. Newell. Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design. Proc. Workshop on Chip Multiprocessor Memory Systems and Interconnects, 2007.Google Scholar
- X. Zhou, E. Q. Li, and Y.-K. Chen. Implementation of H.264 Decoder on General-Purpose Processors with Media Instructions. In Proc. SPIE Conf. on Image and Video Communications and Processing, 2003.Google Scholar
Index Terms
- Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine
Recommendations
A QHD-capable parallel H.264 decoder
ICS '11: Proceedings of the international conference on SupercomputingVideo coding follows the trend of demanding higher performance every new generation, and therefore could utilize many-cores. A complete parallelization of H.264, which is the most advanced video coding standard, was found to be difficult due to the ...
A scalable parallel H.264 decoder on the cell broadband engine architecture
CODES+ISSS '09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesisThe H.264 video codec provides exceptional video compression while imposing dramatic increases in computational complexity over previous standards. While exploiting parallelism in H.264 is notoriously difficult, successful parallel implementations ...
Massively LDPC Decoding on Multicore Architectures
Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...
Comments