research-article

NNsim: fast performance estimation based on sampled simulation of GPGPU kernels for neural networks

Authors:
Jintaek Kang

Seoul National University

Seoul National University
View Profile

,
Kwanghyun Chung

Seoul National University

Seoul National University
View Profile

,
Youngmin Yi

University of Seoul

University of Seoul
View Profile

,
Soonhoi Ha

Seoul National University

Seoul National University
View Profile

DAC '18: Proceedings of the 55th Annual Design Automation ConferenceJune 2018Article No.: 176Pages 1–6https://doi.org/10.1145/3195970.3196079

Published:24 June 2018Publication History

DAC '18: Proceedings of the 55th Annual Design Automation Conference

Pages 1–6

ABSTRACT

Existent GPU simulators are too slow to use for neural networks implemented in GPUs. For fast performance estimation, we propose a novel hybrid method of analytical performance modeling and sampled simulation of GPUs. By taking full advantage of repeated computation of neural networks, three sampling techniques are devised: Inter-Kernel sampling, Intra-Kernel sampling, and Streaming Multiprocessor sampling. The key technique is to estimate the average IPC through sampled simulation, considering the effect of the warp scheduler and memory access contention. Compared with GPGPU-Sim, the proposed technique reduces the simulation time by up to 450 times with less than 5.0% of accuracy loss.

References

Greg Diamos. 2016. Baidu Releases AI Benchmark. (September 2016). https://www.eetimes.com/document.asp?doc_id=1330521 {Online; posted 26-09-2016}.Google Scholar
Bakhoda et al. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS. 163--174.Google Scholar
Farooqui et al. 2011. A framework for dynamically instrumenting GPU compute applications within GPU Ocelot. In GPGPU-4. 9. Google ScholarDigital Library
Fang et al. 2013. FastLanes: An FPGA accelerated GPU microarchitecture simulator. In ICCD. 241--248.Google Scholar
Huang et al. 2014. TBPoint: Reducing simulation time for large-scale GPGPU kernels. In IPDPS. 437--446. Google ScholarDigital Library
He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on CVPR. 770--778.Google ScholarCross Ref
Huang et al. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016).Google Scholar
Ko et al. 2014. Hardware-in-the-loop simulation for CPU/GPU heterogeneous platforms. In Proceedings of the 51st Annual DAC. 1--6. Google ScholarDigital Library
Lee et al. 2016. Parallel GPU Architecture Simulation Framework Exploiting Architectural-Level Parallelism with Timing Error Prediction. IEEE TC 65, 4 (2016), 1253--1265. Google ScholarDigital Library
Redmon et al. 2016. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016).Google Scholar
Sim et al. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, Vol. 47. 11--22. Google ScholarDigital Library
Wang et al. 2017. CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications. TECS 16 (2017), 146. Google ScholarDigital Library
Yu et al. 2015. GPGPU-MiniBench: Accelerating GPGPU micro-architecture simulation. IEEE TC 64, 11 (2015), 3153--3166. Google ScholarDigital Library

Index Terms

NNsim: fast performance estimation based on sampled simulation of GPGPU kernels for neural networks
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

Statistical GPU power analysis using tree-based methods
IGCC '11: Proceedings of the 2011 International Green Computing Conference and Workshops

Graphics Processing Units (GPUs) have emerged as a promising platform for parallel computation. With a large number of scalar processors and abundant memory bandwidth, GPUs provide substantial computation power. While delivering high computation ...
Read More
PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling

Computer architects rely heavily on microarchitecture simulation to evaluate design alternatives. Unfortunately, cycle-accurate simulation is extremely slow, being at least 4 to 6 orders of magnitude slower than real hardware. This longstanding problem ...
Read More
Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation
Special issue: Quality software

Sampling is a well known technique for speeding up time-consuming architectural simulations. An important issue with sampling is the hardware state at the beginning of each sampling unit. This paper presents a highly accurate and highly efficient warmup ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '18: Proceedings of the 55th Annual Design Automation Conference
June 2018
1089 pages
ISBN:9781450357005
DOI:10.1145/3195970

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU simulator
analytical model
sampled simulation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Upcoming Conference
DAC '24

Sponsor:

sigda

61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

San Francisco , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 262
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NNsim: fast performance estimation based on sampled simulation of GPGPU kernels for neural networks

DAC '18: Proceedings of the 55th Annual Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical GPU power analysis using tree-based methods

PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling

Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

NNsim: fast performance estimation based on sampled simulation of GPGPU kernels for neural networks

DAC '18: Proceedings of the 55th Annual Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical GPU power analysis using tree-based methods

PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling

Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media