Abstract
We consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized system that consists of a parameter server and m working machines; each working machine keeps N/m data samples, where N is the total number of samples. In each iteration, up to q of the m working machines suffer Byzantine faults -- a faulty machine in the given iteration behaves arbitrarily badly against the system and has complete knowledge of the system. Additionally, the sets of faulty machines may be different across iterations. Our goal is to design robust algorithms such that the system can learn the underlying true parameter, which is of dimension d, despite the interruption of the Byzantine attacks.
In this paper, based on the geometric median of means of the gradients, we propose a simple variant of the classical gradient descent method. We show that our method can tolerate q Byzantine failures up to 2(1+ε)q ≤ for an arbitrarily small but fixed constant ε > 0. The parameter estimate converges in O(log N) rounds with an estimation error on the order of max{√dq/N, √d/N, which is larger than the minimax-optimal error rate √d/N in the centralized and failure-free setting by at most a factor of √q. The total computational complexity of our algorithm is of O((Nd/m) log N) at each working machine and O(md + kd log3 N) at the central server, and the total communication cost is of O(m d log N). We further provide an application of our general results to the linear regression problem.
A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. To handle this issue in the analysis, we prove that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function.
- Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy-preserving Data Mining. SIGMOD Rec., Vol. 29, 2 (May 2000), 439--450. Google ScholarDigital Library
- Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. 2017. Byzantine-Tolerant Machine Learning. arXiv preprint arXiv:1703.02757 (2017).Google Scholar
- Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, Vol. 3, 1 (2011), 1--122. Google ScholarDigital Library
- Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press. Google ScholarDigital Library
- Hervé Cardot, Peggy Cénac, Pierre-André Zitt, et al\mbox. 2013. Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli, Vol. 19, 1 (2013), 18--43.Google ScholarCross Ref
- Michael B Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. 2016. Geometric median in nearly linear time. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing. ACM, 9--21. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM Vol. 51, 1 (2008), 107--113. Google ScholarDigital Library
- Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. 2016. Robust estimators in high dimensions without the computational intractability. Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on. IEEE, 655--664.Google ScholarCross Ref
- John Duchi, Martin J Wainwright, and Michael I Jordan. 2013. Local privacy and minimax bounds: Sharp rates for probability estimation. Advances in Neural Information Processing Systems. 1529--1537. Google ScholarDigital Library
- Jiashi Feng, Huan Xu, and Shie Mannor. 2014. Distributed Robust Learning. arXiv preprint arXiv:1409.5937 (2014).Google Scholar
- Michael I Jordan, Jason D Lee, and Yun Yang. 2016. Communication-Efficient Distributed Statistical Inference. arXiv preprint arXiv:1605.07689 (2016).Google Scholar
- Charlie Kaufman, Radia Perlman, and Mike Speciner. 2002. Network security: private communication in a public world. Prentice Hall Press. Google ScholarDigital Library
- JHB Kemperman. 1987. The median of a finite measure on a Banach space. Statistical data analysis based on the L1-norm and related methods (Neuchâtel, 1987) (1987), 217--230.Google Scholar
- Jakub Konevcnỳ, Brendan McMahan, and Daniel Ramage. 2015. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575 (2015).Google Scholar
- Kevin A Lai, Anup B Rao, and Santosh Vempala. 2016. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on. IEEE, 665--674.Google ScholarCross Ref
- Hendrik P Lopuhaa and Peter J Rousseeuw. 1991. Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. The Annals of Statistics (1991), 229--248.Google Scholar
- Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment Vol. 5, 8 (2012), 716--727. Google ScholarDigital Library
- Nancy A. Lynch. 1996. Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. \showISBNx9780080504704 Google ScholarDigital Library
- Brendan McMahan and Daniel Ramage. Accessed: 2017-04--10. Federated Learning: Collaborative Machine Learning without Centralized Training Data. https://research.googleblog.com/2017/04/federated-learning-collaborative.html (Accessed: 2017-04--10).Google Scholar
- Song Mei, Yu Bai, and Andrea Montanari. 2016. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534 (2016).Google Scholar
- P Milasevic, GR Ducharme, et al. 1987. Uniqueness of the spatial median. The Annals of Statistics Vol. 15, 3 (1987), 1332--1333.Google ScholarCross Ref
- Stanislav Minsker et al. 2015. Geometric median and robust estimation in Banach spaces. Bernoulli, Vol. 21, 4 (2015), 2308--2335.Google ScholarCross Ref
- Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I Jordan. 2015. Sparknet: Training deep networks in spark. arXiv preprint arXiv:1511.06051 (2015).Google Scholar
- Jyrki Möttönen, Klaus Nordhausen, Hannu Oja, et al. 2010. Asymptotic theory of the spatial median. Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jurevcková. Institute of Mathematical Statistics, 182--193.Google Scholar
- Charles P. Pfleeger and Shari Lawrence Pfleeger. 2002. Security in Computing (3rd ed.). Prentice Hall Professional Technical Reference. Google ScholarDigital Library
- Foster J Provost and Daniel N Hennessy. 1996. Scaling up: Distributed machine learning with cooperation. AAAI/IAAI, Vol. 1. Citeseer, 74--79. Google ScholarDigital Library
- Hanif D Sherali and Dimitri P Bertsekas. 1998. Network optimization: Continuous and discrete models. (1998).Google Scholar
- R. Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. Arxiv preprint arxiv:1011.3027 (2010).Google Scholar
- Cong Wang, Qian Wang, Kui Ren, and Wenjing Lou. 2010. Privacy-preserving public auditing for data storage security in cloud computing. Infocom, 2010 proceedings ieee. Ieee, 1--9. Google ScholarDigital Library
- Yihong Wu. 2017. Lecture Notes on Information-theoretic Methods For High-dimensional Statistics. (April 2017). http://www.stat.yale.edu/ yw562/teaching/it-stats.pdf.Google Scholar
- Yuchen Zhang, John Duchi, and Martin Wainwright. 2015. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res Vol. 16 (2015), 3299--3340. Google ScholarDigital Library
- Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. 2013. Communication-Efficient Algorithms for Statistical Optimization. Journal of Machine Learning Research Vol. 14 (2013), 3321--3363. http://jmlr.org/papers/v14/zhang13b.html Google ScholarDigital Library
Index Terms
- Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent
Recommendations
Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsWe consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized ...
Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent
SIGMETRICS '18We consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized ...
Securing Distributed Gradient Descent in High Dimensional Statistical Learning
We consider unreliable distributed learning systems wherein the training data is kept confidential by external workers, and the learner has to interact closely with those workers to train a model. In particular, we assume that there exists a system ...
Comments