skip to main content
research-article

Leveraging endpoint flexibility in data-intensive clusters

Published: 27 August 2013 Publication History

Abstract

Many applications do not constrain the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these transfers as well as that of others without similar flexibility can be improved. In this paper, we focus on leveraging the flexibility in replica placement during writes to cluster file systems (CFSes), which account for almost half of all cross-rack traffic in data-intensive clusters. The replicas of a CFS write can be placed in any subset of machines as long as they are in multiple fault domains and ensure a balanced use of storage throughout the cluster.
We study CFS interactions with the cluster network, analyze optimizations for replica placement, and propose Sinbad -- a system that identifies imbalance and adapts replica destinations to navigate around congested links. Experiments on EC2 and trace-driven simulations show that block writes complete 1.3X (respectively, 1.58X) faster as the network becomes more balanced. As a collateral benefit, end-to-end completion times of data-intensive jobs improve as well. Sinbad does so with little impact on the long-term storage balance.

References

[1]
Amazon EC2. http://aws.amazon.com/ec2.
[2]
Amazon Simple Storage Service. http://aws.amazon.com/s3.
[3]
Apache Hadoop. http://hadoop.apache.org.
[4]
Facebook Production HDFS. http://goo.gl/BGGuf.
[5]
How Big is Facebook's Data? 2.5 Billion Pieces Of Content And 500+ Terabytes Ingested Every Day. TechCrunch http://goo.gl/n8xhq.
[6]
Total number of objects stored in Amazon S3. http://goo.gl/WTh6o.
[7]
S. Agarwal et al. Reoptimizing data parallel computing. In NSDI, 2012.
[8]
M. Al-Fares et al. Hedera: Dynamic flow scheduling for data center networks. In NSDI, 2010.
[9]
N. Alon et al. Approximation schemes for scheduling on parallel machines. Journal of Scheduling, 1:55--66, 1998.
[10]
G. Ananthanarayanan et al. Reining in the outliers in mapreduce clusters using Mantri. In OSDI, 2010.
[11]
G. Ananthanarayanan et al. Scarlett: Coping with skewed popularity content in mapreduce clusters. In EuroSys, 2011.
[12]
G. Ananthanarayanan et al. PACMan: Coordinated memory caching for parallel jobs. In NSDI, 2012.
[13]
T. Benson et al. MicroTE: Fine grained traffic engineering for data centers. In CoNEXT, 2011.
[14]
P. Bodik et al. Surviving failures in bandwidth-constrained datacenters. In SIGCOMM, 2012.
[15]
D. Borthakur. The Hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.
[16]
D. Borthakur et al. Apache Hadoop goes realtime at Facebook. In SIGMOD, pages 1071--1080, 2011.
[17]
B. Calder et al. Windows Azure Storage: A highly available cloud storage service with strong consistency. In SOSP, 2011.
[18]
M. Castro et al. Scalable application-level anycast for high dynamic groups. LNCS, 2816:47--57, 2003.
[19]
R. Chaiken et al. SCOPE: Easy and efficient parallel processing of massive datasets. In VLDB, 2008.
[20]
M. Chowdhury et al. Managing data transfers in computer clusters with Orchestra. In SIGCOMM, 2011.
[21]
M. Chowdhury and I. Stoica. Coflow: A networking abstraction for cluster applications. In HotNets-XI, pages 31--36, 2012.
[22]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[23]
M. Y. Eltabakh et al. CoHadoop: Flexible data placement and its exploitation in hadoop. In VLDB, 2011.
[24]
A. D. Ferguson et al. Hierarchical policies for Software Defined Networks. In HotSDN, pages 37--42, 2012.
[25]
M. Freedman, K. Lakshminarayanan, and D. Mazières. OASIS: Anycast for any service. NSDI, 2006.
[26]
M. Garey and D. Johnson. "Strong" NP-completeness results: Motivation, examples, and implications. Journal of the ACM, 25(3):499--508, 1978.
[27]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003.
[28]
A. Greenberg et al. VL2: A scalable and flexible data center network. In SIGCOMM, 2009.
[29]
C. Guo et al. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, pages 75--86, 2008.
[30]
C. Guo et al. BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers. ACM SIGCOMM, 2009.
[31]
Z. Guo et al. Spotting code optimizations in data-parallel pipelines through PeriSCOPE. In OSDI, 2012.
[32]
B. Hindman et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, 2011.
[33]
C. Huang et al. Erasure coding in Windows Azure Storage. In USENIX ATC, 2012.
[34]
M. Isard et al. Quincy: Fair scheduling for distributed computing clusters. In SOSP, 2009.
[35]
S. Kandula et al. The nature of datacenter traffic: Measurements and analysis. In IMC, 2009.
[36]
J. Lenstra, D. Shmoys, and É. Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical Programming, 46(1):259--271, 1990.
[37]
R. Motwani, S. Phillips, and E. Torng. Non-clairvoyant scheduling, 1993.
[38]
R. N. Mysore et al. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, pages 39-50, 2009.
[39]
{39} E. Nightingale et al. Flat Datacenter Storage. In OSDI, 2012.
[40]
M. Sathiamoorthy et al. XORing elephants: Novel erasure codes for big data. In PVLDB, 2013.
[41]
A. Thusoo et al. Data warehousing and analytics infrastructure at Facebook. In SIGMOD, 2010.
[42]
R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, 2004.
[43]
M. Zaharia et al. Improving mapreduce performance in heterogeneous environments. In OSDI, 2008.
[44]
M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010.
[45]
M. Zaharia et al. Resilient distributed datasets: A fault tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Cited By

View all
  • (2024)Achieving Tunable Erasure Coding with Cluster-Aware Redundancy TransitioningACM Transactions on Architecture and Code Optimization10.1145/367207721:3(1-24)Online publication date: 10-Jun-2024
  • (2024)Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star NetworksACM Transactions on Architecture and Code Optimization10.1145/366492621:3(1-24)Online publication date: 13-May-2024
  • (2024)Boosting Correlated Failure Repair in SSD Data CentersIEEE Internet of Things Journal10.1109/JIOT.2023.333997911:8(14228-14240)Online publication date: 15-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGCOMM Computer Communication Review
ACM SIGCOMM Computer Communication Review  Volume 43, Issue 4
October 2013
595 pages
ISSN:0146-4833
DOI:10.1145/2534169
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGCOMM '13: Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
    August 2013
    580 pages
    ISBN:9781450320566
    DOI:10.1145/2486001
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2013
Published in SIGCOMM-CCR Volume 43, Issue 4

Check for updates

Author Tags

  1. cluster file systems
  2. constrained anycast
  3. data-intensive applications
  4. datacenter networks
  5. replica placement

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)77
  • Downloads (Last 6 weeks)17
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Achieving Tunable Erasure Coding with Cluster-Aware Redundancy TransitioningACM Transactions on Architecture and Code Optimization10.1145/367207721:3(1-24)Online publication date: 10-Jun-2024
  • (2024)Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star NetworksACM Transactions on Architecture and Code Optimization10.1145/366492621:3(1-24)Online publication date: 13-May-2024
  • (2024)Boosting Correlated Failure Repair in SSD Data CentersIEEE Internet of Things Journal10.1109/JIOT.2023.333997911:8(14228-14240)Online publication date: 15-Apr-2024
  • (2024)SARS: Towards minimizing average Coflow Completion Time in MapReduce systemsComputer Networks10.1016/j.comnet.2024.110429247(110429)Online publication date: Jun-2024
  • (2023)Boosting Erasure-Coded Multi-Stripe Repair in Rack Architecture and Heterogeneous Clusters: Design and AnalysisIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328218034:8(2251-2264)Online publication date: 1-Aug-2023
  • (2023)Towards Adaptive Adjusting and Efficient Scheduling Coflows Based on Deep Reinforcement Learning2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00278(2041-2048)Online publication date: 17-Dec-2023
  • (2023)Deep Reinforcement Learning based Data Placement optimization in Data Center Networks2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386549(2293-2302)Online publication date: 15-Dec-2023
  • (2022)Boosting Cross-rack Multi-stripe Repair in Heterogeneous Erasure-coded ClustersProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545029(1-11)Online publication date: 29-Aug-2022
  • (2022)XHR-Code: An Efficient Wide Stripe Erasure Code to Reduce Cross-Rack Overhead in Cloud Storage Systems2022 41st International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS55811.2022.00033(273-283)Online publication date: Sep-2022
  • (2022)GRPU: An Efficient Graph-based Cross-Rack Parallel Update Scheme for Cloud Storage Systems2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00032(154-161)Online publication date: Oct-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media