skip to main content
research-article
Public Access

Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

Authors Info & Claims
Published:19 December 2017Publication History
Skip Abstract Section

Abstract

We consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized system that consists of a parameter server and m working machines; each working machine keeps N/m data samples, where N is the total number of samples. In each iteration, up to q of the m working machines suffer Byzantine faults -- a faulty machine in the given iteration behaves arbitrarily badly against the system and has complete knowledge of the system. Additionally, the sets of faulty machines may be different across iterations. Our goal is to design robust algorithms such that the system can learn the underlying true parameter, which is of dimension d, despite the interruption of the Byzantine attacks.

In this paper, based on the geometric median of means of the gradients, we propose a simple variant of the classical gradient descent method. We show that our method can tolerate q Byzantine failures up to 2(1+ε)q ≤ for an arbitrarily small but fixed constant ε > 0. The parameter estimate converges in O(log N) rounds with an estimation error on the order of max{√dq/N, √d/N, which is larger than the minimax-optimal error rate √d/N in the centralized and failure-free setting by at most a factor of √q. The total computational complexity of our algorithm is of O((Nd/m) log N) at each working machine and O(md + kd log3 N) at the central server, and the total communication cost is of O(m d log N). We further provide an application of our general results to the linear regression problem.

A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. To handle this issue in the analysis, we prove that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function.

References

  1. Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy-preserving Data Mining. SIGMOD Rec., Vol. 29, 2 (May 2000), 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. 2017. Byzantine-Tolerant Machine Learning. arXiv preprint arXiv:1703.02757 (2017).Google ScholarGoogle Scholar
  3. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, Vol. 3, 1 (2011), 1--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hervé Cardot, Peggy Cénac, Pierre-André Zitt, et al\mbox. 2013. Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli, Vol. 19, 1 (2013), 18--43.Google ScholarGoogle ScholarCross RefCross Ref
  6. Michael B Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. 2016. Geometric median in nearly linear time. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing. ACM, 9--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM Vol. 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. 2016. Robust estimators in high dimensions without the computational intractability. Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on. IEEE, 655--664.Google ScholarGoogle ScholarCross RefCross Ref
  9. John Duchi, Martin J Wainwright, and Michael I Jordan. 2013. Local privacy and minimax bounds: Sharp rates for probability estimation. Advances in Neural Information Processing Systems. 1529--1537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jiashi Feng, Huan Xu, and Shie Mannor. 2014. Distributed Robust Learning. arXiv preprint arXiv:1409.5937 (2014).Google ScholarGoogle Scholar
  11. Michael I Jordan, Jason D Lee, and Yun Yang. 2016. Communication-Efficient Distributed Statistical Inference. arXiv preprint arXiv:1605.07689 (2016).Google ScholarGoogle Scholar
  12. Charlie Kaufman, Radia Perlman, and Mike Speciner. 2002. Network security: private communication in a public world. Prentice Hall Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. JHB Kemperman. 1987. The median of a finite measure on a Banach space. Statistical data analysis based on the L1-norm and related methods (Neuchâtel, 1987) (1987), 217--230.Google ScholarGoogle Scholar
  14. Jakub Konevcnỳ, Brendan McMahan, and Daniel Ramage. 2015. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575 (2015).Google ScholarGoogle Scholar
  15. Kevin A Lai, Anup B Rao, and Santosh Vempala. 2016. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on. IEEE, 665--674.Google ScholarGoogle ScholarCross RefCross Ref
  16. Hendrik P Lopuhaa and Peter J Rousseeuw. 1991. Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. The Annals of Statistics (1991), 229--248.Google ScholarGoogle Scholar
  17. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment Vol. 5, 8 (2012), 716--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nancy A. Lynch. 1996. Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. \showISBNx9780080504704 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Brendan McMahan and Daniel Ramage. Accessed: 2017-04--10. Federated Learning: Collaborative Machine Learning without Centralized Training Data. https://research.googleblog.com/2017/04/federated-learning-collaborative.html (Accessed: 2017-04--10).Google ScholarGoogle Scholar
  20. Song Mei, Yu Bai, and Andrea Montanari. 2016. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534 (2016).Google ScholarGoogle Scholar
  21. P Milasevic, GR Ducharme, et al. 1987. Uniqueness of the spatial median. The Annals of Statistics Vol. 15, 3 (1987), 1332--1333.Google ScholarGoogle ScholarCross RefCross Ref
  22. Stanislav Minsker et al. 2015. Geometric median and robust estimation in Banach spaces. Bernoulli, Vol. 21, 4 (2015), 2308--2335.Google ScholarGoogle ScholarCross RefCross Ref
  23. Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I Jordan. 2015. Sparknet: Training deep networks in spark. arXiv preprint arXiv:1511.06051 (2015).Google ScholarGoogle Scholar
  24. Jyrki Möttönen, Klaus Nordhausen, Hannu Oja, et al. 2010. Asymptotic theory of the spatial median. Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jurevcková. Institute of Mathematical Statistics, 182--193.Google ScholarGoogle Scholar
  25. Charles P. Pfleeger and Shari Lawrence Pfleeger. 2002. Security in Computing (3rd ed.). Prentice Hall Professional Technical Reference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Foster J Provost and Daniel N Hennessy. 1996. Scaling up: Distributed machine learning with cooperation. AAAI/IAAI, Vol. 1. Citeseer, 74--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hanif D Sherali and Dimitri P Bertsekas. 1998. Network optimization: Continuous and discrete models. (1998).Google ScholarGoogle Scholar
  28. R. Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. Arxiv preprint arxiv:1011.3027 (2010).Google ScholarGoogle Scholar
  29. Cong Wang, Qian Wang, Kui Ren, and Wenjing Lou. 2010. Privacy-preserving public auditing for data storage security in cloud computing. Infocom, 2010 proceedings ieee. Ieee, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yihong Wu. 2017. Lecture Notes on Information-theoretic Methods For High-dimensional Statistics. (April 2017). http://www.stat.yale.edu/ yw562/teaching/it-stats.pdf.Google ScholarGoogle Scholar
  31. Yuchen Zhang, John Duchi, and Martin Wainwright. 2015. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res Vol. 16 (2015), 3299--3340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. 2013. Communication-Efficient Algorithms for Statistical Optimization. Journal of Machine Learning Research Vol. 14 (2013), 3321--3363. http://jmlr.org/papers/v14/zhang13b.html Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
            Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 1, Issue 2
            December 2017
            480 pages
            EISSN:2476-1249
            DOI:10.1145/3175501
            Issue’s Table of Contents

            Copyright © 2017 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 19 December 2017
            Published in pomacs Volume 1, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader