ABSTRACT
The emergence of big data has created new challenges for researchers transmitting big data sets across campus networks to local (HPC) cloud resources, or over wide area networks to public cloud services. Unlike conventional HPC systems where the network is carefully architected (e.g., a high speed local interconnect, or a wide area connection between Data Transfer Nodes), today's big data communication often occurs over shared network infrastructures with many external and uncontrolled factors influencing performance.
This paper describes our efforts to understand and characterize the performance of various big data transfer tools such as rclone, cyberduck, and other provider-specific CLI tools when moving data to/from public and private cloud resources. We analyze the various parameter settings available on each of these tools and their impact on performance. Our experimental results give insights into the performance of cloud providers and transfer tools, and provide guidance for parameter settings when using cloud transfer tools. We also explore performance when coming from HPC DTN nodes as well as researcher machines located deep in the campus network, and show that emerging SDN approaches such as the VIP Lanes system can deliver excellent performance even from researchers' machines.
- W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and I. Foster. 2005. The Globus Striped GridFTP Framework and Server. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing. Google ScholarDigital Library
- Amazon. 2018. AWS Command Line Interface. https://aws.amazon.com/cli/. (2018).Google Scholar
- J. Basney and P. Duda. 2007. Clustering the Reliable File Transfer Service. In Proceedings of the 2007 TeraGrid Conference.Google Scholar
- E. Bocchi, I. Drago, and M. Mellia. 2017. Personal Cloud Storage Benchmarks and Comparison. IEEE Transactions on Cloud Computing 5, 4 (Oct 2017), 751--764.Google ScholarCross Ref
- E. Bocchi, M. Mellia, and S. Sarni. 2014. Cloud storage service benchmarking: Methodologies and experimentations. In 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet). 395--400.Google Scholar
- Dropbox. 2018. dbxcli: A command line tool for Dropbox users and team admins. https://github.com/dropbox/dbxcli. (2018).Google Scholar
- J. GriffiRoen, K. Calvert, Z. Fei, S. Rivera, J. Chappell, M. Hayashida, C. Carpenter, Y. Song, and H. Nasir. 2017. VIP Lanes: High-Speed Custom Communication Paths for Authorized Flows. In 2017 26th International Conference on Computer Communication and Networks (ICCCN). 1--9.Google Scholar
- D. Kocher, Y. Langisch, and J. Malek. 2018. Cyberduck. https://cyberduck.io/. (2018).Google Scholar
- Microsoft. 2018. Azure CLI 2.0. https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest. (2018).Google Scholar
- Nick Craig Wood. 2018. Rclone - rsync for cloud storage. https://rclone.org/. (2018).Google Scholar
- The University of Utah. 2018. Exploring the Effects of Options on Performance. https://www.chpc.utah.edu/documentation/software/rclone.php. (2018).Google Scholar
- V. Persico, A. Montieri, and A. PescapÃĺ. 2016. On the Network Performance of Amazon S3 Cloud-Storage Service. In 2016 5th IEEE International Conference on Cloud Networking (Cloudnet). 113--118.Google Scholar
- Petter Rasmussen. 2017. Google Drive CLI client. https://github.com/prasmussen/gdrive. (2017).Google Scholar
- P. Shen, K. Guo, and M. Xiao. 2014. Measuring the QoS of Personal Cloud Storage. In Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT). 1--6.Google Scholar
Index Terms
- Navigating the Unexpected Realities of Big Data Transfers in a Cloud-based World
Recommendations
Big data and cloud computing: new wine or just new bottles?
Cloud computing is an extremely successful paradigm of service oriented computing and has revolutionized the way computing infrastructure is abstracted and used. Three most popular cloud paradigms include: Infrastructure as a Service (IaaS), Platform as ...
Cloud computing & big data computing
COM.Geo '12: Proceedings of the 3rd International Conference on Computing for Geospatial Research and ApplicationsThe amount of data each organization deals with today has been rapidly growing. However, analyzing large datasets commonly referred to as "big data" has been a huge challenge due to lack of suitable tools and adequate computing resources. Why are ...
Big data analytics in Cloud computing: an overview
AbstractBig Data and Cloud Computing as two mainstream technologies, are at the center of concern in the IT field. Every day a huge amount of data is produced from different sources. This data is so big in size that traditional processing tools are unable ...
Comments