research-article

Heterogeneity-aware Distributed Parameter Servers

Authors:

Lele YuAuthors Info & Claims

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 463 - 478

https://doi.org/10.1145/3035918.3035933

Published: 09 May 2017 Publication History

Abstract

We study distributed machine learning in heterogeneous environments in this work. We first conduct a systematic study of existing systems running distributed stochastic gradient descent; we find that, although these systems work well in homogeneous environments, they can suffer performance degradation, sometimes up to 10x, in heterogeneous environments where stragglers are common because their synchronization protocols cannot fit a heterogeneous setting. Our first contribution is a heterogeneity-aware algorithm that uses a constant learning rate schedule for updates before adding them to the global parameter. This allows us to suppress stragglers' harm on robust convergence. As a further improvement, our second contribution is a more sophisticated learning rate schedule that takes into consideration the delayed information of each update. We theoretically prove the valid convergence of both approaches and implement a prototype system in the production cluster of our industrial partner Tencent Inc. We validate the performance of this prototype using a range of machine-learning workloads. Our prototype is 2-12x faster than other state-of-the-art systems, such as Spark, Petuum, and TensorFlow; and our proposed algorithm takes up to 6x fewer iterations to converge.

References

[1]

Amazon ec2. https://aws.amazon.com/ec2/.

[2]

Mahout project. http://mahout.apache.org/.

[3]

Spark mllib. http://spark.apache.org/mllib/.

[4]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

[5]

A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, pages 873--881, 2011.

Digital Library

[6]

S. Agrawal, V. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004.

Digital Library

[7]

A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable inference in latent variable models. In WSDM, pages 123--132, 2012.

Digital Library

[8]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177--186. 2010.

[9]

L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421--436. 2012.

[10]

Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using simsql. In SIGMOD, pages 637--648, 2013.

Digital Library

[11]

D. W. Cheung, S. D. Lee, and Y. Xiao. Effect of data skewness and workload balance in parallel data mining. TKDE, 14(3):498--514, 2002.

Digital Library

[12]

T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In OSDI, pages 571--582, 2014.

Digital Library

[13]

B. Cui, H. Mei, and B. C. Ooi. Big data: the driver for innovation in databases. National Science Review, 1(1):27--30, 2014.

[14]

H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, et al. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, pages 37--48, 2014.

Digital Library

[15]

C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. VLDB, 3(1--2):48--57, 2010.

Digital Library

[16]

W. Dai, J. Wei, J. K. Kim, S. Lee, J. Yin, Q. Ho, and E. P. Xing. Petuum: A framework for iterative-convergent distributed ml. arXiv preprint arXiv:1312.7651, 2013.

[17]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In NIPS, pages 1223--1231, 2012.

Digital Library

[18]

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In SIGKDD, pages 69--77, 2011.

Digital Library

[19]

S. Ghandeharizadeh and D. J. DeWitt. Hybrid-range partitioning strategy: A new declustering strategy for multiprocessor database machines. In VLDB, pages 481--492, 1990.

Digital Library

[20]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011.

Digital Library

[21]

A. Harlap, H. Cui, W. Dai, J. Wei, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Addressing the straggler problem for iterative convergent parallel ml. In SoCC, pages 98--111, 2016.

Digital Library

[22]

J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. VLDB, 5(12):1700--1711, 2012.

Digital Library

[23]

Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, pages 1223--1231, 2013.

Digital Library

[24]

Y. Huang, B. Cui, J. Jiang, K. Hong, W. Zhang, and Y. Xie. Real-time video recommendation exploration. In SIGMOD, pages 35--46, 2016.

Digital Library

[25]

W. Kim and F. H. Lochovsky. Object-oriented concepts, databases, and applications. 1989.

[26]

A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In SIGMOD, pages 1969--1984, 2015.

Digital Library

[27]

A. Lastovetsky, I.-H. Mkwawa, and M. O'Flynn. An accurate communication model of a heterogeneous cluster based on a switch-enabled ethernet network. In ICPADS, page 6, 2006.

Digital Library

[28]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, pages 583--598, 2014.

Digital Library

[29]

M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In SIGKDD, pages 661--670, 2014.

Digital Library

[30]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. PVLDB, 5(8):716--727, 2012.

Digital Library

[31]

J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying suspicious urls: an application of large-scale online learning. In ICML, pages 681--688, 2009.

Digital Library

[32]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.

Digital Library

[33]

S. Prasad, A. Fard, V. Gupta, J. Martinez, J. LeFevre, V. Xu, M. Hsu, and I. Roy. Large-scale predictive analytics in vertica: fast data transfer, distributed model creation, and in-database prediction. In SIGMOD, pages 1657--1668, 2015.

Digital Library

[34]

C. Qin and F. Rusu. Scalable i/o-bound parallel incremental gradient descent for big data analytics in glade. In DanaC, pages 16--20, 2013.

Digital Library

[35]

C. Qin and F. Rusu. Speculative approximations for terascale distributed gradient descent optimization. In DanaC, 2015.

Digital Library

[36]

J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002.

Digital Library

[37]

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693--701, 2011.

Digital Library

[38]

X. Shi, B. Cui, Y. Shao, and Y. Tong. Tornado: A system for real-time iterative analysis over evolving data. In SIGMOD, pages 417--430, 2016.

Digital Library

[39]

A. Smola and S. Narayanamurthy. An architecture for parallel topic models. PVLDB, 3(1--2):703--710, 2010.

Digital Library

[40]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In SOCC, page 5, 2013.

Digital Library

[41]

J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In SOCC, pages 381--394, 2015.

Digital Library

[42]

X. Wu, X. Zhu, G.-Q. Wu, and W. Ding. Data mining with big data. TKDE, 26(1):97--107, 2014.

Digital Library

[43]

R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. Graphx: A resilient distributed graph system on spark. In GRADES, page 2, 2013.

Digital Library

[44]

N. Xu, B. Cui, L. Chen, Z. Huang, and Y. Shao. Heterogeneous environment aware streaming graph partitioning. TKDE, 27(6):1560--1572, 2015.

Digital Library

[45]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012.

Digital Library

[46]

C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. PVLDB, 7(12):1283--1294, 2014.

Digital Library

[47]

S.-Y. Zhao and W.-J. Li. Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In AAAI, 2016.

Digital Library

[48]

M. Zinkevich, J. Langford, and A. J. Smola. Slow learners are fast. In NIPS, pages 2331--2339, 2009.

Digital Library

[49]

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.

Digital Library

Cited By

Ling ZJiang XTan XHe HZhu SYang J(2025)Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs Over Heterogeneous InfrastructureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.350658836:2(150-167)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3506588
Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Peng JLiu QChen ZShao YShen YChen LCao J(2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
https://doi.org/10.1007/s00778-024-00897-2
Show More Cited By

Recommendations

SGD-r α: A real-time α-suffix averaging method for SGD with biased gradient estimates
Abstract
Stochastic gradient descent (SGD) is a simple and efficient method for solving large-scale stochastic optimization problems in machine learning. It has been shown that the convergence rate of SGD can be improved by α-suffix averaging ...
Compositional Stochastic Average Gradient for Machine Learning and Related Applications
Intelligent Data Engineering and Automated Learning – IDEAL 2018
Abstract
Many machine learning, and statistical inference problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). ...
A quantitative analysis of fault tolerance mechanisms for parallel machine learning systems with parameter servers
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

Parallel computation is essential for machine learning systems that handle large amount of data for training. One popular form of parallel machine learning is the one called 'Data parallel' which use large number of parameter servers that manage ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

May 2017

1810 pages

ISBN:9781450341974

DOI:10.1145/3035918

General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

154
Total Citations
View Citations
2,331
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)5

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ling ZJiang XTan XHe HZhu SYang J(2025)Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs Over Heterogeneous InfrastructureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.350658836:2(150-167)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3506588
Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Peng JLiu QChen ZShao YShen YChen LCao J(2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
https://doi.org/10.1007/s00778-024-00897-2
Yu LSun XAlbelaihi RPark CShao S(2024)Dynamic Client Clustering, Bandwidth Allocation, and Workload Optimization for Semi-Synchronous Federated LearningElectronics10.3390/electronics1323458513:23(4585)Online publication date: 21-Nov-2024
https://doi.org/10.3390/electronics13234585
Yang JHu YXiao XMeng CZeng LLi X(2024)A trust-aware model based on reliability-based friendly relationship method in IoT networksJournal of High Speed Networks10.3233/JHS-24003730:4(639-655)Online publication date: 15-Oct-2024
https://doi.org/10.3233/JHS-240037
Lekkala C(2024)Leveraging Reinforcement Learning for Autonomous Data Pipeline Optimization and ManagementSSRN Electronic Journal10.2139/ssrn.4908414Online publication date: 2024
https://doi.org/10.2139/ssrn.4908414
Liu ZXu XQiao PLi D(2024)Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A SurveyACM Computing Surveys10.1145/370345357:4(1-35)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3703453
Naik AStein AWu YNaik MWong E(2024)TorchQL: A Programming Framework for Integrity Constraints in Machine LearningProceedings of the ACM on Programming Languages10.1145/36498418:OOPSLA1(833-863)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649841
Tosi MVenugopal VTheobald M(2024)TensAIR: Real-Time Training of Neural Networks from Data-streamsProceedings of the 2024 8th International Conference on Machine Learning and Soft Computing10.1145/3647750.3647762(73-82)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3647750.3647762
Miao XJia ZCui BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654683
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten