ABSTRACT
Existent GPU simulators are too slow to use for neural networks implemented in GPUs. For fast performance estimation, we propose a novel hybrid method of analytical performance modeling and sampled simulation of GPUs. By taking full advantage of repeated computation of neural networks, three sampling techniques are devised: Inter-Kernel sampling, Intra-Kernel sampling, and Streaming Multiprocessor sampling. The key technique is to estimate the average IPC through sampled simulation, considering the effect of the warp scheduler and memory access contention. Compared with GPGPU-Sim, the proposed technique reduces the simulation time by up to 450 times with less than 5.0% of accuracy loss.
- Greg Diamos. 2016. Baidu Releases AI Benchmark. (September 2016). https://www.eetimes.com/document.asp?doc_id=1330521 {Online; posted 26-09-2016}.Google Scholar
- Bakhoda et al. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS. 163--174.Google Scholar
- Farooqui et al. 2011. A framework for dynamically instrumenting GPU compute applications within GPU Ocelot. In GPGPU-4. 9. Google ScholarDigital Library
- Fang et al. 2013. FastLanes: An FPGA accelerated GPU microarchitecture simulator. In ICCD. 241--248.Google Scholar
- Huang et al. 2014. TBPoint: Reducing simulation time for large-scale GPGPU kernels. In IPDPS. 437--446. Google ScholarDigital Library
- He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on CVPR. 770--778.Google ScholarCross Ref
- Huang et al. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016).Google Scholar
- Ko et al. 2014. Hardware-in-the-loop simulation for CPU/GPU heterogeneous platforms. In Proceedings of the 51st Annual DAC. 1--6. Google ScholarDigital Library
- Lee et al. 2016. Parallel GPU Architecture Simulation Framework Exploiting Architectural-Level Parallelism with Timing Error Prediction. IEEE TC 65, 4 (2016), 1253--1265. Google ScholarDigital Library
- Redmon et al. 2016. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016).Google Scholar
- Sim et al. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, Vol. 47. 11--22. Google ScholarDigital Library
- Wang et al. 2017. CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications. TECS 16 (2017), 146. Google ScholarDigital Library
- Yu et al. 2015. GPGPU-MiniBench: Accelerating GPGPU micro-architecture simulation. IEEE TC 64, 11 (2015), 3153--3166. Google ScholarDigital Library
Index Terms
- NNsim: fast performance estimation based on sampled simulation of GPGPU kernels for neural networks
Recommendations
Statistical GPU power analysis using tree-based methods
IGCC '11: Proceedings of the 2011 International Green Computing Conference and WorkshopsGraphics Processing Units (GPUs) have emerged as a promising platform for parallel computation. With a large number of scalar processors and abundant memory bandwidth, GPUs provide substantial computation power. While delivering high computation ...
PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling
Computer architects rely heavily on microarchitecture simulation to evaluate design alternatives. Unfortunately, cycle-accurate simulation is extremely slow, being at least 4 to 6 orders of magnitude slower than real hardware. This longstanding problem ...
Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation
Special issue: Quality softwareSampling is a well known technique for speeding up time-consuming architectural simulations. An important issue with sampling is the hardware state at the beginning of each sampling unit. This paper presents a highly accurate and highly efficient warmup ...
Comments