ABSTRACT
Evaluation techniques in microprocessor design are mostly based on simulating selected application's samples using a cycle-accurate simulator. These samples usually correspond to different phases of the application stream. To identify these phases, relevant high-level application statistics are collected and clustered using a process named "Off-Line Phase Classification". The purpose of phase classification is to reduce the number of samples that need to be simulated with the minimum loss in accuracy (compared to simulating the complete set of samples).
Unfortunately, when directly applied to HW/SW co-designed processors the traditional phase classifications do not provide a good trade-off between accuracy and the number of samples. As an example, according to our experimental results, to achieve a 4% error (compared to simulating all the samples) one needs to simulate 2.5X more samples for the case of HW/SW co-designed processors compared to what is necessary for HW-only processors.
In this paper, we propose a novel off-line phase classification scheme called TOL Description Vector (TDV), which is suitable for HW/SW co-designed processors. TDV targets at estimating the TOL particularities and on average gives significantly better accuracy than traditional phase classification for any number of selected samples. For instance, TDV reaches the average error of 3% with 3X less samples than traditional classification. These benefits apply for different TOL and microarchitecture configurations.
- Quick EMUlation tool (http://http://www.qemu.org/).Google Scholar
- Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks. (http://www.spec.org/cpu2006/).Google Scholar
- M. Annavaram, R. Rakvic, M. Polito, J. Y Bouguet, R. Hankins, and B. Davies. The Fuzzy Correlation between Code and Performance Predictability. In 37th International Symposium on Microarchitecture, pages 93--104, 2004. Google ScholarDigital Library
- E. Argollo, A. Falcon, P. Faraboschi, M. Monchiero, and D. Ortega. Cotson: Infrastructure for full system simulation. SIGOPS Oper. Syst. Rev., 43(1):52--61, January 2009. Google ScholarDigital Library
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A Transparent Dynamic Optimization System. In Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, PLDI '00, pages 1--12, 2000. Google ScholarDigital Library
- A. Branković, K. Stavrou, E. Gibert, and A. González. Performance Analysis and Predictability of the Software Layer in Dynamic Binary Translators/Optimizers. In Proceedings of the ACM International Conference on Computing Frontiers, CF '13, pages 15:1--15:10, 2013. Google ScholarDigital Library
- T. E. Carlson, W. Heirman, and L. Eeckhout. Sampled simulation of multi-threaded applications. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software, pages 2--12, 2013.Google ScholarCross Ref
- J. Dehnert, B. Grant, J. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '03, pages 15--24, 2003. Google ScholarDigital Library
- K. Ebcioglu, E. Altman, M. Gschwind, and S. Sathaye. Dynamic Binary Translation and Optimization. IEEE Transactions on Computers, 50(6):529--548, 2001. Google ScholarDigital Library
- K. Ebcioglu and E. R. Altman. Daisy: Dynamic Compilation for 100% Architectural Compatibility. In Proceedings of the 24th annual International Symposium on Computer Architecture, ISCA '97, pages 26--37, 1997. Google ScholarDigital Library
- A. Georges, D. Buytaert, L. Eeckhout, and K. De Bosschere. Method-Level Phase Behavior in Java Workloads. In Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA '04, pages 270--287, 2004. Google ScholarDigital Library
- N. Hardavellas et al. Simex: A Fast, Accurate, Flexible Full-System Simulation Framework for Performance Evaluation of Server Architecture. SIGMETRICS Perform. Eval. Rev., 31(4):31--34, March 2004. Google ScholarDigital Library
- M. Hauswirth and A. Diwan. Phases in Branch Targets of Java Programs. Technical Report CU-CS-983-04, 2004.Google Scholar
- J. D. Hiser and D. Williams et al. Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems. ACM Trans. Archit. Code Optim., 8(2):9:1--9:28, June 2011. Google ScholarDigital Library
- S. Hu and J. E. Smith. Reducing Startup Time in Co-Designed Virtual Machines. In Proceedings of the 33rd annual international symposium on Computer Architecture, ISCA '06, pages 277--288, 2006. Google ScholarDigital Library
- T. Huffmire and T. Sherwood. Wavelet-based phase classification. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT '06, pages 95--104, 2006. Google ScholarDigital Library
- IBM. The PowerPC 440 Core. White-Paper, IBM Microelectronics Division Research Triangle Park NC, 1999.Google Scholar
- H. Kim and J. E. Smith. Hardware Support for Control Transfers in Code Caches. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 36, pages 253--, 2003. Google ScholarDigital Library
- A. Klaiber. The Technology Behind the Crusoe Processors. White paper, January 2000.Google Scholar
- K. Krewell. Transmeta gets more efficeon. Microprocessor Report, 2003.Google Scholar
- N. Kumar and N. Neelakantam. Indirect Branches in the Transmeta Efficeon Processor. In Proceedings of the 2011 Workshop on Infrastructure for Software/Hardware co-design, WISH '11, 2011.Google Scholar
- J. Lau, S. Schoemackers, and B. Calder. Structures for phase classification. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS '04, pages 57--67, 2004. Google ScholarDigital Library
- P. Nagpurkar and C. Krintz. Phase-based Visualization and Analysis of Java Programs. In Elsevier Science of Computer Programming, Special issue on Principles of programming in Java, volume 59, Number 1--2, pages 131--164, 2006. Google ScholarDigital Library
- N. Neelakantam, D. Ditzel, and C. Zilles. A Real System Evaluation of Hardware Atomicity for Software Speculation. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS XV, pages 29--38, 2010. Google ScholarDigital Library
- G. Ottoni et al. AstroLIT: enabling simulation-based microarchitecture comparison between Intel and Transmeta designs. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF '11, pages 21:1--21:2, 2011. Google ScholarDigital Library
- D. Pavlou, A. Brankovic, R. Kumar, M. Gregori, S. Kyriakos, E. Gibert, and A. Gonzalez. DARCO: Infrastructure for Research on HW/SW co-designed Virtual Machines. In Proceedings of AMAS workshop, in conjuction with ISCA, 2011.Google Scholar
- D. Pavlou, E. Gibert, F. Latorre, and A. Gonzalez. DDGacc: Boosting Dynamic DDG-based Binary Optimizations through Specialized Hardware Support. In Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments, VEE '12, pages 159--168, 2012. Google ScholarDigital Library
- S. Sathaye et al. BOA: Targeting multi-gigahertz with Binary Translation. In Proceedings of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pages 2--11, 1999.Google Scholar
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS X, pages 45--57, 2002. Google ScholarDigital Library
- J. Smith and R. Nair. Virtual Machines: Versatile Platforms for Systems and Processes. The Morgan Kaufmann Series in Computer Architecture and Design. 2005. Google ScholarDigital Library
- Y. Wu, S. Hu, E. Borin, and C. Wang. A HW/SW co-designed Heterogeneous multi-core Virtual Machine for energy-efficient general purpose computing. In Proceedings of the 2011 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '11, pages 236--245, 2011. Google ScholarDigital Library
- R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical sampling. In Proceedings of the 30th annual International Symposium on Computer Architecture, ISCA '03, pages 84--97, 2003. Google ScholarDigital Library
- C. Wung, Y. Wu, and M. Cintra. Acceldroid: Co-designed acceleration of Android bytecode. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '13, pages 1--10, 2013. Google ScholarDigital Library
Index Terms
- Accurate off-line phase classification for HW/SW co-designed processors
Recommendations
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationEvaluation techniques in microprocessor design are mostly based on simulating selected application samples using a cycle-accurate simulator. In order to achieve accurate results, microarchitectural structures are warmed-up for a few million instructions ...
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationEvaluation techniques in microprocessor design are mostly based on simulating selected application samples using a cycle-accurate simulator. In order to achieve accurate results, microarchitectural structures are warmed-up for a few million instructions ...
Domain-Specific Language for HW/SW Co-design for FPGAs
DSL '09: Proceedings of the IFIP TC 2 Working Conference on Domain-Specific LanguagesThis article describes FSMLanguage, a domain-specific language for HW/SW co-design targeting platform FPGAs. Modern platform FPGAs provide a wealth of configurable logic in addition to embedded processors, distributed RAM blocks, and DSP slices in order ...
Comments