skip to main content
10.1145/2133173.2133177acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
short-paper

Robust distributed orthogonalization based on randomized aggregation

Published: 14 November 2011 Publication History

Abstract

The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to node failures compared to existing aggregation methods. On a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method (rdmGS), which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms.

References

[1]
T. Aysal, M. Yildiz, A. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. IEEE Trans. Signal Processing, 57(7):2748 --2761, 2009.
[2]
S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE Transactions on Information Theory, Special issue of IEEE Transactions on Information Theory and IEEE ACM Transactions on Networking, 52:2508--2530, 2006.
[3]
Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th international symposium on High performance distributed computing, HPDC '11, pages 73--84, New York, NY, USA, 2011. ACM.
[4]
Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. Parallel and Distributed Systems, IEEE Transactions on, 19(12):1628--1641, 2008.
[5]
A. Dimakis, A. Sarwate, and M. Wainwright. Geographic gossip: Efficient averaging for sensor networks. IEEE Trans. Signal Processing, 56(3):1205--1216, 2008.
[6]
C. Engelmann, H. H. Ong, and S. L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189--194. ACTA Press, Calgary, AB, Canada, 2009.
[7]
I. Eyal, I. Keidar, and R. Rom. LiMoSense -- Live Monitoring in Dynamic Sensor Networks. In 7th Int'l Symp. on Algorithms for Sensor Systems, Wireless Ad Hoc Networks and Autonomous Mobile Entities (ALGOSENSORS'11), 2011.
[8]
G. E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. J. Dongarra. Process fault tolerance: Semantics, design and applications for high performance computing. International Journal of High Performance Computing Applications, 19(4):465--477, 2005.
[9]
K. Ferreira, R. Riesen, R. Oldfield, J. Stearley, J. Laros, K. Pedretti, and T. Brightwell. rMPI: increasing fault resiliency in a message-passing environment. Technical report, no. SAND2011--2488.
[10]
A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated checkpointing without domino effect for send-deterministic mpi applications. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 989--1000, 2011.
[11]
K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, 1984.
[12]
M. Jelasity, A. Montresor, and O. Babaoglu. Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst., 23:219--252, 2005.
[13]
P. Jesus, C. Baquero, and P. S. Almeida. Fault-tolerant aggregation by flow updating. In Proceedings of the 9th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems, DAIS '09, pages 73--86, Berlin, Heidelberg, 2009. Springer-Verlag.
[14]
R. Karp, C. Schindelhauer, S. Shenker, and B. Vocking. Randomized rumor spreading. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pages 565--, Washington, DC, USA, 2000. IEEE Computer Society.
[15]
D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In FOCS '03: Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, pages 482--491. IEEE Computer Society, 2003.
[16]
D. Kempe and F. McSherry. A decentralized algorithm for spectral analysis. Journal of Computer and System Sciences, 74(1):70 -- 83, 2008.
[17]
A.-M. Kermarrec and M. van Steen. Gossiping in distributed systems. SIGOPS Oper. Syst. Rev., 41:2--7, 2007.
[18]
D. Mosk-Aoyama and D. Shah. Computing separable functions via gossip. In Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing, PODC '06, pages 113--122, New York, NY, USA, 2006. ACM.
[19]
A. Olshevsky and J. N. Tsitsiklis. Convergence speed in distributed consensus and averaging. SIAM J. Control Optim., 48:33--55, 2009.
[20]
J. Plank. Improving the performance of coordinated checkpointers on networks of workstations using raid techniques. In Reliable Distributed Systems, 1996. Proceedings., 15th Symposium on, pages 76 --85, 1996.
[21]
J. Plank, K. Li, and M. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9(10):972--986, 1998.
[22]
B. Schroeder and G. A. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78(1):012022, 2007.
[23]
D. Shah. Gossip algorithms. Found. Trends Netw., 3:1--125, 2009.
[24]
H. Strakova and W. N. Gansterer. A distributed eigensolver based on randomized communication. Technical Report RLCTA-2011--1, University of Vienna, Research Group Theory and Applications of Algorithms, 2011. ( submitted).
[25]
H. Strakova, W. N. Gansterer, and T. Zemen. Distributed QR factorization based on randomized algorithms. In Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics, 2011. phto appear.
[26]
M. Varela, K. Ferreira, and R. Riesen. Fault-tolerance for exascale systems. In Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on, pages 1--4, 2010.

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2017)Aggregation protocols in light of reliable communication2017 IEEE 16th International Symposium on Network Computing and Applications (NCA)10.1109/NCA.2017.8171346(1-4)Online publication date: Oct-2017
  • (2015)Gossip-based spectral clustering of distributed data streams2015 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2015.7237058(325-333)Online publication date: Jul-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ScalA '11: Proceedings of the second workshop on Scalable algorithms for large-scale systems
November 2011
44 pages
ISBN:9781450311809
DOI:10.1145/2133173
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed orthogonalization
  2. distributed reduction
  3. fault tolerant orthogonalization
  4. push-sum algorithm
  5. randomized orthogonalization

Qualifiers

  • Short-paper

Conference

SC '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 12 of 20 submissions, 60%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2017)Aggregation protocols in light of reliable communication2017 IEEE 16th International Symposium on Network Computing and Applications (NCA)10.1109/NCA.2017.8171346(1-4)Online publication date: Oct-2017
  • (2015)Gossip-based spectral clustering of distributed data streams2015 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2015.7237058(325-333)Online publication date: Jul-2015
  • (2013)A Distributed Eigensolver for Loosely Coupled NetworksProceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing10.1109/PDP.2013.18(51-57)Online publication date: 27-Feb-2013
  • (2013)Fault Tolerance Properties of Gossip-Based Distributed Orthogonal Iteration MethodsProcedia Computer Science10.1016/j.procs.2013.05.18218(189-198)Online publication date: 2013
  • (2012)Improving Fault Tolerance and Accuracy of a Distributed Reduction AlgorithmProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.89(643-651)Online publication date: 10-Nov-2012
  • (2012)PosterProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.220Online publication date: 10-Nov-2012
  • (2012)AbstractProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.219(1405-1406)Online publication date: 10-Nov-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media