research-article

Wheel: Accelerating CNNs with Distributed GPUs via Hybrid Parallelism and Alternate Strategy

Authors:

Zhiguang QinAuthors Info & Claims

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 393 - 401

https://doi.org/10.1145/3123266.3123435

Published: 19 October 2017 Publication History

Abstract

Convolutional Neural Networks (CNNs) have been widely used and achieve amazing performance, typically at the cost of very expensive computation. Some methods accelerate the CNN training by distributed GPUs those deploying GPUs on multiple servers. Unfortunately, they need to transmit a large amount of data among servers, which leads to long data transmitting time and long GPU idle time. Towards this end, we propose a novel hybrid parallelism architecture named "Wheel" to accelerate the CNN training by reducing the transmitted data and fully using GPUs simultaneously. Specifically, Wheel first partitions the layers of a CNN into two kinds of modules: convolutional module and fully-connected module, and deploys them following the proposed hybrid parallelism. In this way, Wheel transmits only a few parameters of CNNs among different servers, and transmits most of the parameters within the same server. The time to transmit data is significantly reduced. Second, to fully run each GPU and reduce the idle time, Wheel devises an alternate strategy deploying multiple workers on each GPU. Once one worker is suspended for receiving data, another one in the same GPU starts to execute the computing task. The workers in each GPU run concurrently and repeatedly like Wheels. Experiments are conducted to show the outperformance of the proposed scheme over the state-of-the-art parallel approaches.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

[2]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[3]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI. 571--582.

Digital Library

[4]

Adam Coates, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng. 2013. Deep learning with COTS HPC systems. Proceedings of the 30 th International Conference on Machine Learning volume 28 (2013).

Digital Library

[5]

Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.

[6]

Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, 4.

Digital Library

[7]

Qi Dai, Jianguo Li, Jingdong Wang, and Yu-Gang Jiang. 2016. Binary optimized hashing. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1247--1256.

Digital Library

[8]

Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).

[9]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, and Quoc V. Le. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.

Digital Library

[10]

Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor. 2014. Theano-based large-scale visual recognition with multiple gpus. arXiv preprint arXiv:1412.2302 (2014).

[11]

Anshuman Goswami, Jeffrey Young, Karsten Schwan, Naila Farooqui, Ada Gavrilovska, Matthew Wolf, and Greg Eisenhauer. 2016. GPUShare: Fair-Sharing Middleware for GPU Clouds. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. IEEE, 1769--1776.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[13]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. ACM, 675--678.

Digital Library

[14]

Atsushi Kawai, Kenji Yasuoka, Kazuyuki Yoshikawa, and Tetsu Narumi. 2012. Distributed-Shared CUDA: Virtualization of large- scale GPU systems for programmability and reliability. In The Fourth International Conference on Future Computational Technologies and Applications.

[15]

Markus Koskela and Jorma Laaksonen. 2014. Convolutional Network Features for Scene Recognition. In Proceedings of the ACM International Conference on Multimedia - MM '14. 1169--1172.

Digital Library

[16]

Alex Krizhevsky. 2014. One weird trick for parallelizing con- volutional neural networks. arXiv preprint arXiv:1404.5997 (2014).

[17]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

Digital Library

[18]

Chuck L Lawson, Richard J. Hanson, David R Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software (TOMS) 5, 3 (1979), 308--323.

Digital Library

[19]

Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network in network. In ICLR.

[20]

Min Lin, Shuo Li, Xuan Luo, and Shuicheng Yan. 2014. Purine: A bi-graph based deep learning framework. arXiv preprint arXiv:1412.6249 (2014).

[21]

He Ma, Fei Mao, and Graham W. Taylor. 2016. Theano-MPI: a Theano-based Distributed Training Framework. arXiv preprint arXiv:1605.08325 (2016).

[22]

Garrick Orchard, Jacob G Martin, R Jacob Vogelstein, and Ralph Etienne-Cummings. 2013. Fast neuromimetic object recognition using FPGA outperforms GPU implementations. IEEE trans- actions on neural networks and learning systems 24, 8 (2013), 1239--1252.

[23]

Markos Papadonikolakis and Christos-Savvas Bouganis. 2012. Novel cascade FPGA accelerator for support vector machines classification. IEEE transactions on neural networks and learning systems 23, 7 (2012), 1040--1052.

[24]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[25]

Ilya Sutskever, James Martens, George E Dahl, and Geoffrey E Hinton. 2013. On the importance of initialization and momentum in deep learning. ICML (3) 28 (2013), 1139--1147.

Digital Library

[26]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.

[27]

Jinhui Tang, Xiangbo Shu, Zechao Li, Guo-Jun Qi, and Jingdong Wang. 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 4s (2016), 68.

Digital Library

[28]

The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, and others. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016).

[29]

Peisong Wang and Jian Cheng. 2016. Accelerating convolutional neural networks for mobile applications. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 541--545.

Digital Library

[30]

Wei Wang, Gang Chen, Anh Tien Tuan Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, and Sheng Wang. 2015. SINGA: Putting deep learning in the hands of multimedia users. In Proceedings of the 23rd ACM international conference on Multi-media. ACM, 25--34.

Digital Library

[31]

Omry Yadan, Keith Adams, Yaniv Taigman, and MarcAurelio Ranzato. 2013. Multi-gpu training of convnets. arXiv preprint arXiv:1312.5853 9 (2013).

[32]

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818--833.

Cited By

Habib GQureshi S(2022)Optimization and acceleration of convolutional neural networks: A surveyJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.10.00434:7(4244-4268)Online publication date: Jul-2022
https://doi.org/10.1016/j.jksuci.2020.10.004
Haryanto TArymurthy ASuhartanto HKusmardi K(2020)GPUs Utilization of Residual Network Training for Colon Histopathological Images Classification2020 International Conference on Computer Science and Its Application in Agriculture (ICOSICA)10.1109/ICOSICA49951.2020.9243276(1-8)Online publication date: 16-Sep-2020
https://doi.org/10.1109/ICOSICA49951.2020.9243276
Du XLiu QLi ZQin ZTang J(2019)Cauchy Matrix Factorization for Tag-Based Social Image RetrievalIEEE Access10.1109/ACCESS.2019.29405987(132302-132310)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2940598

Index Terms

Wheel: Accelerating CNNs with Distributed GPUs via Hybrid Parallelism and Alternate Strategy
1. Computing methodologies

Recommendations

Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

In this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on ...
Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the ...
Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system
XSEDE '12: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond

The Uintah Computational Framework was developed to provide an environment for solving fluid-structure interaction problems on structured adaptive grids on large-scale, long-running, data-intensive problems. Uintah uses a combination of fluid-flow ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '17: Proceedings of the 25th ACM international conference on Multimedia

October 2017

2028 pages

ISBN:9781450349062

DOI:10.1145/3123266

General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

973 Program of China

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
269
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Habib GQureshi S(2022)Optimization and acceleration of convolutional neural networks: A surveyJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.10.00434:7(4244-4268)Online publication date: Jul-2022
https://doi.org/10.1016/j.jksuci.2020.10.004
Haryanto TArymurthy ASuhartanto HKusmardi K(2020)GPUs Utilization of Residual Network Training for Colon Histopathological Images Classification2020 International Conference on Computer Science and Its Application in Agriculture (ICOSICA)10.1109/ICOSICA49951.2020.9243276(1-8)Online publication date: 16-Sep-2020
https://doi.org/10.1109/ICOSICA49951.2020.9243276
Du XLiu QLi ZQin ZTang J(2019)Cauchy Matrix Factorization for Tag-Based Social Image RetrievalIEEE Access10.1109/ACCESS.2019.29405987(132302-132310)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2940598

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents