research-article

Leveraging endpoint flexibility in data-intensive clusters

Authors:

Mosharaf Chowdhury,

Srikanth Kandula,

Ion StoicaAuthors Info & Claims

ACM SIGCOMM Computer Communication Review, Volume 43, Issue 4

Pages 231 - 242

https://doi.org/10.1145/2534169.2486021

Published: 27 August 2013 Publication History

Abstract

Many applications do not constrain the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these transfers as well as that of others without similar flexibility can be improved. In this paper, we focus on leveraging the flexibility in replica placement during writes to cluster file systems (CFSes), which account for almost half of all cross-rack traffic in data-intensive clusters. The replicas of a CFS write can be placed in any subset of machines as long as they are in multiple fault domains and ensure a balanced use of storage throughout the cluster.

We study CFS interactions with the cluster network, analyze optimizations for replica placement, and propose Sinbad -- a system that identifies imbalance and adapts replica destinations to navigate around congested links. Experiments on EC2 and trace-driven simulations show that block writes complete 1.3X (respectively, 1.58X) faster as the network becomes more balanced. As a collateral benefit, end-to-end completion times of data-intensive jobs improve as well. Sinbad does so with little impact on the long-term storage balance.

References

[1]

Amazon EC2. http://aws.amazon.com/ec2.

[2]

Amazon Simple Storage Service. http://aws.amazon.com/s3.

[3]

Apache Hadoop. http://hadoop.apache.org.

[4]

Facebook Production HDFS. http://goo.gl/BGGuf.

[5]

How Big is Facebook's Data? 2.5 Billion Pieces Of Content And 500+ Terabytes Ingested Every Day. TechCrunch http://goo.gl/n8xhq.

[6]

Total number of objects stored in Amazon S3. http://goo.gl/WTh6o.

[7]

S. Agarwal et al. Reoptimizing data parallel computing. In NSDI, 2012.

Digital Library

[8]

M. Al-Fares et al. Hedera: Dynamic flow scheduling for data center networks. In NSDI, 2010.

Digital Library

[9]

N. Alon et al. Approximation schemes for scheduling on parallel machines. Journal of Scheduling, 1:55--66, 1998.

[10]

G. Ananthanarayanan et al. Reining in the outliers in mapreduce clusters using Mantri. In OSDI, 2010.

Digital Library

[11]

G. Ananthanarayanan et al. Scarlett: Coping with skewed popularity content in mapreduce clusters. In EuroSys, 2011.

Digital Library

[12]

G. Ananthanarayanan et al. PACMan: Coordinated memory caching for parallel jobs. In NSDI, 2012.

Digital Library

[13]

T. Benson et al. MicroTE: Fine grained traffic engineering for data centers. In CoNEXT, 2011.

Digital Library

[14]

P. Bodik et al. Surviving failures in bandwidth-constrained datacenters. In SIGCOMM, 2012.

Digital Library

[15]

D. Borthakur. The Hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.

[16]

D. Borthakur et al. Apache Hadoop goes realtime at Facebook. In SIGMOD, pages 1071--1080, 2011.

Digital Library

[17]

B. Calder et al. Windows Azure Storage: A highly available cloud storage service with strong consistency. In SOSP, 2011.

Digital Library

[18]

M. Castro et al. Scalable application-level anycast for high dynamic groups. LNCS, 2816:47--57, 2003.

[19]

R. Chaiken et al. SCOPE: Easy and efficient parallel processing of massive datasets. In VLDB, 2008.

Digital Library

[20]

M. Chowdhury et al. Managing data transfers in computer clusters with Orchestra. In SIGCOMM, 2011.

Digital Library

[21]

M. Chowdhury and I. Stoica. Coflow: A networking abstraction for cluster applications. In HotNets-XI, pages 31--36, 2012.

Digital Library

[22]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[23]

M. Y. Eltabakh et al. CoHadoop: Flexible data placement and its exploitation in hadoop. In VLDB, 2011.

Digital Library

[24]

A. D. Ferguson et al. Hierarchical policies for Software Defined Networks. In HotSDN, pages 37--42, 2012.

Digital Library

[25]

M. Freedman, K. Lakshminarayanan, and D. Mazières. OASIS: Anycast for any service. NSDI, 2006.

Digital Library

[26]

M. Garey and D. Johnson. "Strong" NP-completeness results: Motivation, examples, and implications. Journal of the ACM, 25(3):499--508, 1978.

Digital Library

[27]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003.

Digital Library

[28]

A. Greenberg et al. VL2: A scalable and flexible data center network. In SIGCOMM, 2009.

Digital Library

[29]

C. Guo et al. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, pages 75--86, 2008.

Digital Library

[30]

C. Guo et al. BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers. ACM SIGCOMM, 2009.

Digital Library

[31]

Z. Guo et al. Spotting code optimizations in data-parallel pipelines through PeriSCOPE. In OSDI, 2012.

Digital Library

[32]

B. Hindman et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, 2011.

Digital Library

[33]

C. Huang et al. Erasure coding in Windows Azure Storage. In USENIX ATC, 2012.

Digital Library

[34]

M. Isard et al. Quincy: Fair scheduling for distributed computing clusters. In SOSP, 2009.

Digital Library

[35]

S. Kandula et al. The nature of datacenter traffic: Measurements and analysis. In IMC, 2009.

Digital Library

[36]

J. Lenstra, D. Shmoys, and É. Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical Programming, 46(1):259--271, 1990.

Digital Library

[37]

R. Motwani, S. Phillips, and E. Torng. Non-clairvoyant scheduling, 1993.

[38]

R. N. Mysore et al. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, pages 39-50, 2009.

Digital Library

[39]

{39} E. Nightingale et al. Flat Datacenter Storage. In OSDI, 2012.

Digital Library

[40]

M. Sathiamoorthy et al. XORing elephants: Novel erasure codes for big data. In PVLDB, 2013.

Digital Library

[41]

A. Thusoo et al. Data warehousing and analytics infrastructure at Facebook. In SIGMOD, 2010.

Digital Library

[42]

R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, 2004.

Digital Library

[43]

M. Zaharia et al. Improving mapreduce performance in heterogeneous environments. In OSDI, 2008.

Digital Library

[44]

M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010.

Digital Library

[45]

M. Zaharia et al. Resilient distributed datasets: A fault tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Digital Library

Cited By

Zhang FNan FXu BShen ZZhai JKalplun DShu J(2024)Achieving Tunable Erasure Coding with Cluster-Aware Redundancy TransitioningACM Transactions on Architecture and Code Optimization10.1145/367207721:3(1-24)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3672077
Zhou HFeng D(2024)Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star NetworksACM Transactions on Architecture and Code Optimization10.1145/366492621:3(1-24)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3664926
Chen JLi ZSun QWang NSu L(2024)Boosting Correlated Failure Repair in SSD Data CentersIEEE Internet of Things Journal10.1109/JIOT.2023.333997911:8(14228-14240)Online publication date: 15-Apr-2024
https://doi.org/10.1109/JIOT.2023.3339979
Show More Cited By

Index Terms

Leveraging endpoint flexibility in data-intensive clusters
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles

Recommendations

Leveraging endpoint flexibility in data-intensive clusters
SIGCOMM '13: Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM

Many applications do not constrain the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these ...
Replica Placement Strategy for Data Grid Environment

Data Grid is an infrastructure that manages huge amount of data files, and provides intensive computational resources across geographically distributed collaboration. To increase resource availability and to ease resource sharing in such environment, ...
GridFS: highly scalable I/O solution for clusters and computational grids

I/O has always been performance bottleneck for applications running on clusters. Most traditional storage architectures fail to meet the requirement of concurrent access to the same file that is posed by most High-Performance Computing (HPC) ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGCOMM Computer Communication Review

ACM SIGCOMM Computer Communication Review Volume 43, Issue 4

October 2013

595 pages

ISSN:0146-4833

DOI:10.1145/2534169

Issue’s Table of Contents

SIGCOMM '13: Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
August 2013
580 pages
ISBN:9781450320566
DOI:10.1145/2486001
General Chairs:
Dah Ming Chiu
Chinese University of Hong Kong, China
,
Jia Wang
AT&T Labs - Research, USA
,
Program Chairs:
Paul Barford
University of Wisconsin, USA
,
Srinivasan Seshan
Carnegie Mellon University, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2013

Published in SIGCOMM-CCR Volume 43, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

158
Total Citations
View Citations
914
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)17

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang FNan FXu BShen ZZhai JKalplun DShu J(2024)Achieving Tunable Erasure Coding with Cluster-Aware Redundancy TransitioningACM Transactions on Architecture and Code Optimization10.1145/367207721:3(1-24)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3672077
Zhou HFeng D(2024)Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star NetworksACM Transactions on Architecture and Code Optimization10.1145/366492621:3(1-24)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3664926
Chen JLi ZSun QWang NSu L(2024)Boosting Correlated Failure Repair in SSD Data CentersIEEE Internet of Things Journal10.1109/JIOT.2023.333997911:8(14228-14240)Online publication date: 15-Apr-2024
https://doi.org/10.1109/JIOT.2023.3339979
Wang XXu SZhao Y(2024)SARS: Towards minimizing average Coflow Completion Time in MapReduce systemsComputer Networks10.1016/j.comnet.2024.110429247(110429)Online publication date: Jun-2024
https://doi.org/10.1016/j.comnet.2024.110429
Zhou HFeng D(2023)Boosting Erasure-Coded Multi-Stripe Repair in Rack Architecture and Heterogeneous Clusters: Design and AnalysisIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328218034:8(2251-2264)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3282180
Wang ZYan KChang GHuang CXiao SYang Y(2023)Towards Adaptive Adjusting and Efficient Scheduling Coflows Based on Deep Reinforcement Learning2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00278(2041-2048)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00278
Kaler RToshniwal D(2023)Deep Reinforcement Learning based Data Placement optimization in Data Center Networks2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386549(2293-2302)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386549
Zhou HFeng D(2022)Boosting Cross-rack Multi-stripe Repair in Heterogeneous Erasure-coded ClustersProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545029(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545029
Yang GXue HGu YWu CLi JGuo MLi SXie XDong YZhao Y(2022)XHR-Code: An Efficient Wide Stripe Erasure Code to Reduce Cross-Rack Overhead in Cloud Storage Systems2022 41st International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS55811.2022.00033(273-283)Online publication date: Sep-2022
https://doi.org/10.1109/SRDS55811.2022.00033
Jia RDeng HGu YXue HWu CLi SLi JXue GGuo M(2022)GRPU: An Efficient Graph-based Cross-Rack Parallel Update Scheme for Cloud Storage Systems2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00032(154-161)Online publication date: Oct-2022
https://doi.org/10.1109/ICCD56317.2022.00032
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents