ABSTRACT
Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, such as cache capacity or memory bandwidth, thus degrading performance. On the other hand, having too few threads makes inefficient use of the resources available. Beyond static resource assignment, the program inputs and dynamic system state (e.g., what other applications are executing in the system) can have a significant impact on the right number of threads to use for a particular application.
To address this problem we present the Thread Tailor, a dynamic system that automatically adjusts the number of threads in an application to optimize system efficiency. The Thread Tailor leverages offline analysis to estimate what type of threads will exist at runtime and the communication patterns between them. Using this information Thread Tailor dynamically combines threads to better suit the needs of the target system. Thread Tailor adjusts not only to the architecture, but also other applications in the system, and this paper demonstrates that this type of adjustment can lead to significantly better use of thread-level parallelism in real-world architectures.
- L. O. Andersen. Program analysis and specialization for the c programming language. Technical report, University of Copenhagen, 1994.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008. Google ScholarDigital Library
- M. Chu, R. Ravindran, and S. Mahlke. Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. In Proc. of the 40th Annual International Symposium on Microarchitecture, pages 369--378, 2007. Google ScholarDigital Library
- J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. In Proc. of the 2000 International Symposium on on Operating Systems Design and Implementation, pages 255--266, 2008. Google ScholarDigital Library
- M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In Proc. of the 2006 International Conference on Supercomputing, pages 157--166, 2006. Google ScholarDigital Library
- L. Dagum and R. Menon. OpenMP: an industry standard API for shared-memory programming. IEEE Computer Science and Engineering, 5(1):46--55, 1998. Google ScholarDigital Library
- Y. Ding, M. Kandemir, P. Raghavan, and M. Irwin. A helper thread based edp reduction scheme for adapting application execution in cmps. In Proc. of the 2008 IEEE Symposium on on Parallel and Distributed Processing, pages 1--14, 2008.Google ScholarCross Ref
- C. Fiduccia and R. Mattheyses. A linear time heuristic for improving network partitions. In Proc. of the 19th Design Automation Conference, pages 175--181, 1982. Google ScholarDigital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979. Google ScholarDigital Library
- A. Ghuloum. Ct: C for throughput computing (whitepaper), 2009. http://techresearch.intel.com/articles/Tera-Scale/1514.htm.Google Scholar
- M. Hirzel, D. von Dincklage, A. Diwan, and M. Hind. Fast online pointer analysis. ACM Transactions on Programming Languages and Systems, 29(2):11, Apr. 2007. Google ScholarDigital Library
- C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for SMT multiprocessor architectures. In Proc. of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 236--246, 2005. Google ScholarDigital Library
- B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(2):291--207, Feb. 1970.Google ScholarCross Ref
- M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. of the SIGPLAN '08 Conference on Programming Language Design and Implementation, pages 114--124, June 2008. Google ScholarDigital Library
- M. Kulkarni et al. Optimistic Parallelism Requires Abstractions. In Proc. of the SIGPLAN '07 Conference on Programming Language Design and Implementation, pages 211--222, June 2007. Google ScholarDigital Library
- R. Kumar, G. Agrawal, and G. Gao. Compiling several classes of communication patterns on a multithreaded architecture. In Proc. of the 2002 IEEE Symposium on on Parallel and Distributed Processing, pages 18--23, 2002. Google ScholarDigital Library
- R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proc. of the 31st Annual International Symposium on Computer Architecture, 2004. Google ScholarDigital Library
- R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In Proc. of the 36th Annual International Symposium on Microarchitecture, pages 81--92, Dec. 2003. Google ScholarDigital Library
- N. Lakshiminarayana, S. Rao, and H. Kim. Asymmetricity Aware Scheduling Algorithms for Asymmetric Processors. In Workshop on the Interaction between Operating Systems and Computer Architecture, 2009.Google Scholar
- C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004. Google ScholarDigital Library
- J. Li and J. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In Proc. of the 12th International Symposium on High-Performance Computer Architecture, pages 77--87, 2006.Google Scholar
- U. Nawathe et al. An 8-core, 64-thread, 64-bit, power efficient SPARC SoC (Niagara2), Feb. 2007. In Proc. of ISSCC.Google Scholar
- J. Nieplocha et al. Evaluating the potential of multithreaded platforms for irregular scientific computations. In Proc. of the 2007 ACM Conference on Computing Frontiers, pages 47--58, 2007. Google ScholarDigital Library
- Nvidia. CUDA Programming Guide, June 2007. http://developer.download.nvidia.com/compute/cuda.Google Scholar
- K. Olukotun et al. The case for a single chip multiprocessor. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, 1996. Google ScholarDigital Library
- M. Qureshi and Y. Patt. Partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. of the 39th Annual International Symposium on Microarchitecture, pages 423--432, 2006. Google ScholarDigital Library
- R. Ravindran, R. Senger, E. Marsman, G. Dasika, M. Guthaus, S. Mahlke, and R. Brown. Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File. In Proc. of the 2003 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 125--136, 2003. Google ScholarDigital Library
- H. Schwartz, D. Marpe, and T. Wiegand. Overview of the scalable video coding extension of the h.264/avc standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1103--1120, Sept. 2007. Google ScholarDigital Library
- M. A. Suleman, M. Qureshi, and Y. Patt. Feedback Driven Threading: Power-Efficient and High-Performance Execution of Multithreaded Workloads on CMPs. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 277--286, 2008. Google ScholarDigital Library
- P. Wang et al. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proc. of the SIGPLAN '07 Conference on Programming Language Design and Implementation, pages 156--166, 2007. Google ScholarDigital Library
- S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd annual international symposium on Computer architecture, pages 24--36, 1995. Google ScholarDigital Library
- Q. Wu et al. A dynamic compilation framework for controlling microprocessor energy and performance. In Proc. of the 38th Annual International Symposium on Microarchitecture, pages 271--282, 2005. Google ScholarDigital Library
- Y. Xie and G. H. Loh. PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 174--183, 2009. Google ScholarDigital Library
Index Terms
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications
Recommendations
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications
ISCA '10Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many ...
SOS: saving time in dynamic race detection with stationary analysis
OOPSLA '11Data races are subtle and difficult to detect errors that arise during concurrent program execution. Traditional testing techniques fail to find these errors, but recent research has shown that targeted dynamic analysis techniques can be developed to ...
SOS: saving time in dynamic race detection with stationary analysis
OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applicationsData races are subtle and difficult to detect errors that arise during concurrent program execution. Traditional testing techniques fail to find these errors, but recent research has shown that targeted dynamic analysis techniques can be developed to ...
Comments