Abstract
The rise of ad-hoc data-intensive computing has led to the development of data-parallel programming systems such as Map/Reduce and Hadoop, which achieve scalability by tightly coupling storage and computation. This can be limiting when the ratio of computation to storage is not known in advance, or changes over time. In this work, we examine decoupling storage and computation in Hadoop through SuperDataNodes, which are servers that contain an order of magnitude more disks than traditional Hadoop nodes. We found that SuperDataNodes are not only capable of supporting workloads with high storage-to-processing workloads, but in some cases can outperform traditional Hadoop deployments through better management of a large centralized pool of disks.
- Yahoo Developer Blog. http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte%_in_162.html.Google Scholar
- Hadoop Core. http://hadoop.apache.org/core.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design &; Implementation, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- Amazon EC2 and S3. http://aws.amazon.com.Google Scholar
- Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power provisioning for a warehouse-sized computer. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 13--23, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In NSDI. USENIX Association, Cambridge, MA, 2007. Google ScholarDigital Library
- Jim Gray. Distributed computing economics. Queue, 6(3):63--68, 2008. Google ScholarDigital Library
- Rack Aware Placement JIRA Issue. http://issues.apache.org/jira/browse/HADOOP-692.Google Scholar
- Amazon Elastic Map/Reduce. http://aws.amazon.com/elasticmapreduce.Google Scholar
- The SAM/QFS Storage System. http://www.opensolaris.org/os/project/samqfs.Google Scholar
- Prof. Joseph M. Hellerstein DataBeta Blog. http://databeta.wordpress.com/2009/05/14/bigdata-node-density.Google Scholar
- Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, lfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. In Richard Draves and Robbert van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008. Google ScholarDigital Library
Index Terms
- Decoupling storage and computation in Hadoop with SuperDataNodes
Recommendations
Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications
GRID '11: Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid ComputingMapReduce is a promising parallel programming model for processing large data sets. Hadoop is an up-and-coming open-source implementation of MapReduce. It uses the Hadoop Distributed File System (HDFS) to store input and output data. Due to a lack of ...
Optimization strategy of Hadoop small file storage for big data in healthcare
As the era of "big data" comes, the data processing platform like Hadoop was born at the right moment. But its carrier for storage, Hadoop distributed file system (HDFS) has the great weakness in storage of the numerous small files. The storage of ...
Optimizing the Hadoop MapReduce Framework with high-performance storage devices
Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today's Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. ...
Comments