ABSTRACT
We present an application-driven customization methodology for energy-efficient inter-core communication in embedded multiprocessors. The methodology leverages configurable cache architectures and integrates software and hardware support to achieve energy-efficient data sharing between producer and consumer tasks. The technique is especially beneficial for data-streaming applications exploiting pipeline parallelism where computational phases are mapped to separate processor cores. The application-driven data cache partitioning achieves low-power and low-latency (no coherence misses) inter-core data sharing. The basic premise of the proposed technique is to separate through cache partitioning the private data from the several shared data buffers used by each producer/consumer task. Such partitioning will result in the following benefits: 1) Data cache accesses caused by the processor and the coherence mechanism will need to access only a cache partition instead of the entire cache structure, resulting in significant power reductions; 2) Interference (caused by both processor and coherence activities) across private data and the several shared data buffers is eliminated - this in turn enables the efficient implementation of application-driven remote cache updates at synchronization boundaries.
- M. Ekman, F. Dahlgren and P. Stenstrom, "TLB and snoop energy-reduction using virtual caches in low-power chipmicroprocessors", in ISLPED, pp. 243--246, August 2002. Google ScholarDigital Library
- M. Loghi, M. Poncino and L. Benini, "Cache coherence tradeoffs in shared-memory MPSoCs", ACM Transactions on Embedded Computing Systems, vol. 5, n. 2, pp. 383--407, 2006. Google ScholarDigital Library
- P. Cumming, "The TI OMAP Platform Approach to SoC", in Winning the SOC Revolution, Kluwer Academic Publishers, 2003.Google Scholar
- W. Wolf, "The Future of Multiprocessor Systems-on-Chips", in DAC, pp. 681--685, June 2004. Google ScholarDigital Library
- A. Moshovos, G. Memik, A. Choudhary and B. Falsafi, "JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers", in HPCA, 2001. Google ScholarDigital Library
- A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence", in ISCA, 2005. Google ScholarDigital Library
- C. Yu and P. Petrov, "Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors", in CODES+ISSS, pp. 245--250, 2007. Google ScholarDigital Library
- A. Patel and K. Ghose, "Energy-efficient MESI cache coherence with pro-active snoop filtering for multicore microprocessors", in ISLPED, pp. 247--252, 2008. Google ScholarDigital Library
- C. Ballapuram, A. Sharif and H-H. Lee, "Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors", in ASPLOS, pp. 60--69, 2008. Google ScholarDigital Library
- W. Thies, V. Chandrasekhar and S. Amarasinghe, "A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs", in MICRO, pp. 356--369, 2007. Google ScholarDigital Library
- D. H. Albonesi, "Selective Cache Ways: On-Demand Cache Resource Allocation", in 32nd MICRO, pp. 248--259, November 1999. Google ScholarDigital Library
- A. Gordon-Ross and F. Vahid, "A self-tuning configurable cache", in DAC, pp. 234--237, 2007. Google ScholarDigital Library
- C. Zhang, F. Vahid and W. Najjar, "A highly configurable cache architecture for embedded systems", in ISCA, pp. 136--146, 2003. Google ScholarDigital Library
- J. Montanaro et al., "A 160Mhz, 32b 0.5W CMOS RISC Microprocessor", in IEEE ISCC, pp. 214--229, February 1996.Google ScholarDigital Library
- B. Khailany, W. Dally, U. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, A. Chang and S. Rixner, "Imagine: Media Processing with Streams", IEEE Micro, vol. 21, n. 2, pp. 35--46, 2001. Google ScholarDigital Library
- C. Lee, M. Potkonjak and W. H. Mangione-Smith, "Media-Bench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems", in MICRO, pp. 330--335, Dec 1997. Google ScholarDigital Library
- M.R Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge and R. B. Brown, "MiBench: A free, commercially representative embedded benchmark suite", in WWC, pp. 3--14, Dec 2001. Google ScholarDigital Library
- N. Binkert, R. Dreslinski, L. Hsu, K. Lim, A. Saidi and S. Reinhardt, "The M5 Simulator: Modeling Networked Systems", IEEE Micro, vol. 26, n. 4, pp. 52--60, 2006. Google ScholarDigital Library
- S. Thoziyoor, N. Muralimanohar, J. Ahn and N. Jouppi, "CACTI 5.1", Technical report, HP Laboratories Palo Alto, April 2008.Google Scholar
Index Terms
Low-power inter-core communication through cache partitioning in embedded multiprocessors
Recommendations
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Low-power snoop architecture for synchronized producer-consumer embedded multiprocessing
We introduce a cross-layer customization methodology where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache lookups even for ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsTranslation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Comments