ABSTRACT
Evaluation techniques in microprocessor design are mostly based on simulating selected application samples using a cycle-accurate simulator. In order to achieve accurate results, microarchitectural structures are warmed-up for a few million instructions prior to statistics collection. Unfortunately, this strategy cannot be applied to HW/SW co-designed processors, in which a Transparent Optimization software Layer (TOL) translates and optimizes code on-the-fly from a guest ISA to an internal host custom microarchitecture. We show that the warm-up period in this case needs to be 3-4 orders of magnitude longer than what is needed for traditional microprocessor designs because the TOL state needs to be warmed-up as well.
In this paper, we propose a novel simulation technique for HW/SW co-designed processors based on adapting the optimization promotion thresholds using high level application statistics in order to find the best trade-off between accuracy and simulation cost. In particular, the proposed technique reduces the simulation cost by 65X with an average error of just 0.75%. Furthermore, as opposed to other alternatives, the proposed technique satisfies the additional requirement of allowing evaluation using different TOL and microarchitectural configurations.
- PIN instrumentation tool (http://www.pintool.org/).Google Scholar
- Quick EMUlation tool (http://http://www.qemu.org/).Google Scholar
- Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks. (http://www.spec.org/cpu2006/).Google Scholar
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A Transparent Dynamic Optimization System. In Proceedings of the ACM SIGPLAN 2000 conference on Programming Language Design and Implementation, PLDI '00, pages 1--12, 2000. Google ScholarDigital Library
- J. Dehnert, B. Grant, J. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '03, pages 15--24, 2003. Google ScholarDigital Library
- K. Ebcioglu, E. Altman, M. Gschwind, and S. Sathaye. Dynamic Binary Translation and Optimization. IEEE Transactions on Computers, 50(6):529--548, 2001. Google ScholarDigital Library
- K. Ebcioglu and E. R. Altman. Daisy: Dynamic Compilation for 100% Architectural Compatibility. In Proceedings of the 24th annual International Symposium on Computer Architecture, ISCA '97, pages 26--37, 1997. Google ScholarDigital Library
- L. Eeckhout, Y. Luo, K. De Bosschere, and L. K. John. BLRL: Accurate and Efficient Warmup for Sampled Processor Simulation. The Computer Journal, vol 48, pages 451--459, 2005. Google ScholarDigital Library
- A. Falcon, P. Faraboschi, and D. Ortega. Combining Simulation and Virtualization through Dynamic Sampling. In Proceedings of the 2007 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS '07, pages 72--83, 2007.Google ScholarCross Ref
- A. Georges, D. Buytaert, and L. Eeckhout. Statistically Rigorous Java Performance Evaluation. In Proceedings of the 22nd annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications, OOPSLA '07, pages 57--76, 2007. Google ScholarDigital Library
- A. Georges, L. Eeckhout, and D. Buytaert. Java Performance Evaluation through Rigorous Replay Compilation. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications, OOPSLA '08, pages 367--384, 2008. Google ScholarDigital Library
- N. Hardavellas, S. Somogyi, T. Wenisch, E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. Hoe, and A. Nowatzyk. Simflex: A Fast, Accurate, Flexible Full-System Simulation Framework for Performance Evaluation of Server Architecture. SIGMETRICS Perform. Eval. Rev., 31(4):31--34, March 2004. Google ScholarDigital Library
- J. W. Haskins Jr. and K. Skadron. Minimal Subset Evaluation: Rapid Warm-Up for Simulated Hardware State. In Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors, ICCD '01, pages 32--39, 2001. Google ScholarDigital Library
- J. W. Haskins Jr and K. Skadron. Memory Reference Reuse Latency: Accelerated Warmup for Sampled Microarchitecture Simulation. In Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS '03, pages 195--203, 2003. Google ScholarDigital Library
- J. D. Hiser, D. Williams, Wei Hu, J. Davidson, J. Mars, and B. Childers. Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems. ACM Trans. Archit. Code Optim., 8(2):9:1--9:28, June 2011. Google ScholarDigital Library
- S. Hu and J. E. Smith. Reducing Startup Time in Co-Designed Virtual Machines. In Proceedings of the 33rd annual International Symposium on Computer Architecture, ISCA '06, pages 277--288, 2006. Google ScholarDigital Library
- X. Huang, S. M. Blackburn, K. S. McKinley, J. E. B. Moss, Z. Wang, and P. Cheng. The Garbage Collection Advantage: Improving Program Locality. In Proceedings of the 19th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '04, pages 69--80, 2004. Google ScholarDigital Library
- IBM. The PowerPC 440 Core. White-Paper, IBM Microelectronics Division Research Triangle Park NC, 1999.Google Scholar
- H. Kim and J. E. Smith. Hardware Support for Control Transfers in Code Caches. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 36, pages 253--264, 2003. Google ScholarDigital Library
- A. Klaiber. The Technology Behind the Crusoe Processors. White paper, January 2000.Google Scholar
- S. Kluyskens and L. Eeckhout. Branch History Matching: Branch Predictor Warmup for Sampled Simulation. In Proceedings of the 2nd International Conference on High Performance Embedded Architectures and compilers, HiPEAC'07, pages 153--167, 2007. Google ScholarDigital Library
- K. Krewell. Transmeta Gets More Efficeon. Microprocessor Report, 2003.Google Scholar
- N. Kumar and N. Neelakantam. Indirect Branches in the Transmeta Efficeon Processor. In Proceedings of the 2011 Workshop on Infrastructure for Software/Hardware Co-Design, WISH '11, 2011.Google Scholar
- M. Lupon, E. Gibert, G. Magklis, S. Samudrala, R. Martínez, K. Stavrou, and D. Ditzel. Speculative Hardware/Software Co-designed FP Multiply-Add. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIX, 2014.Google ScholarDigital Library
- M.C. Merten, A.R. Trick, C.N. George, J.C. Gyllenhaal, and W.-M.W. Hwu. A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization. In Proceedings of the 26th International Symposium on Computer Architecture, 1999., ISCA '99, pages 136--148, 1999. Google ScholarDigital Library
- N. Neelakantam, D. Ditzel, and C. Zilles. A Real System Evaluation of Hardware Atomicity for Software Speculation. In Proceedings of the fifteenth edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 29--38, 2010. Google ScholarDigital Library
- G. Ottoni, G. Chinya, G. Hoflehner, J. Collins, A. Kumar, E. Schuchman, D. Ditzel, R. Singhal, and H. Wang. AstroLIT: Enabling Simulation-Based Microarchitecture Comparison between Intel and Transmeta Designs. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF '11, pages 21:1--21:2, 2011. Google ScholarDigital Library
- D. Pavlou, A. Brankovic, R. Kumar, M. Gregori, K. Stavrou, E. Gibert, and A. Gonzalez. DARCO: Infrastructure for Research on HW/SW co-designed Virtual Machines. In Proceedings of the 4th AMAS-BT workshop, held in conjunction with ISCA, 2011.Google Scholar
- D. Pavlou, A. Brankovic, R. Kumar, K. Stavrou, E. Gibert, and A. Gonzalez. Quantitative Characterization of the Software Layer of a State-Of-The-Art Co-Designed Virtual Machine. Technical Report. Universitat Politècnica de Catalunya, Spain, 2012.Google Scholar
- D. Pavlou, E. Gibert, F. Latorre, and A. Gonzalez. DDGacc: Boosting Dynamic DDG-based Binary Optimizations through Specialized Hardware Support. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE '12, pages 159--168, 2012. Google ScholarDigital Library
- N. Sachindran and J. E. B. Moss. Mark-copy: Fast Copying GC with Less Space Overhead. In Proceedings of the 18th annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications, OOPSLA '03, pages 326--343, 2003. Google ScholarDigital Library
- S. Sathaye et al. BOA: Targeting Multi-Gigahertz with Binary Translation. In Proceedings of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pages 2--11, 1999.Google Scholar
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS X, pages 45--57, 2002. Google ScholarDigital Library
- J. Smith and R. Nair. Virtual Machines: Versatile Platforms for Systems and Processes. The Morgan Kaufmann Series in Computer Architecture and Design. 2005. Google ScholarDigital Library
- C. Wang, Y. Wu, and M. Cintra. Acceldroid: Co-designed Acceleration of Android Bytecode. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '13, pages 1--10, 2013. Google ScholarDigital Library
- Y. Wu, Shiliang Hu, E. Borin, and Cheng Wang. A HW/SW Co-Designed Heterogeneous Multi-Core Virtual Machine for Energy-Efficient General Purpose Computing. In Proceedings of the 2011 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '11, pages 236--245, 2011. Google ScholarDigital Library
- R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical sampling. In Proceedings of the 30th annual International Symposium on Computer Architecture, ISCA '03, pages 84--97, 2003. Google ScholarDigital Library
- J. J. Yi, S. Kodakara, R. Sendag, D. Lilja, and D. Hawkins. Characterizing and Comparing Prevailing Simulation Techniques. In Proceedings of the International Symposium on High Performance Computer Architecture, HPCA '05, pages 266--277, 2005. Google ScholarDigital Library
Index Terms
- Warm-Up Simulation Methodology for HW/SW Co-Designed Processors
Recommendations
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationEvaluation techniques in microprocessor design are mostly based on simulating selected application samples using a cycle-accurate simulator. In order to achieve accurate results, microarchitectural structures are warmed-up for a few million instructions ...
Accurate off-line phase classification for HW/SW co-designed processors
CF '14: Proceedings of the 11th ACM Conference on Computing FrontiersEvaluation techniques in microprocessor design are mostly based on simulating selected application's samples using a cycle-accurate simulator. These samples usually correspond to different phases of the application stream. To identify these phases, ...
A HW/SW Co-designed Programmable Functional Unit
In this paper, we propose a novel programmable functional unit (PFU) to accelerate general purpose application execution on a modern out-of-order x86 processor. Code is transformed and instructions are generated that run on the PFU using a co-designed ...
Comments