ABSTRACT
The cloud computing paradigm provides the "illusion" of infinite resources and, therefore, becomes a promising candidate for large-scale data-intensive computing. In this paper, we explore experiment-driven performance models for data-intensive workloads executing in an infrastructure-as-a-service (IaaS) public cloud. The performance models help in predicting the workload behaviour, and serve as a key component of a larger framework for resource provisioning in the cloud. We determine a suitable prediction technique after comparing popular regression methods. We also enumerate the variables that impact variance in the workload performance in a public cloud. Finally, we build a performance model for a multi-tenant data service in the Amazon cloud. We find that a linear classifier is sufficient in most cases. On a few occasions, a linear classifier is unsuitable and non-linear modeling is required, which is time consuming. Consequently, we recommend that a linear classifier be used in training the performance model in the first instance. If the resulting model is unsatisfactory, then non-linear modeling can be carried out in the next step.
- Use WEKA in your Java code. http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.Google Scholar
- Abouzour, M., Salem, K., and Bumbulis, P., 2010. Automatic tuning of the multiprogramming level in Sybase SQL Anywhere. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), Long Beach, California, USA, 99--104.Google Scholar
- Ahmad, M., Aboulnaga, A., and Babu, S., 2009. Query interactions in database workloads. In Proceedings of the Second International Workshop on Testing Database Systems ACM, Providence, Rhode Island, US, 1--6. Google ScholarDigital Library
- Ahmad, M., Aboulnaga, A., Babu, S., and Munagala, K., 2008. Modeling and exploiting query interactions in database systems. In Proceedings of the 17th ACM conference on Information and knowledge management ACM, Napa Valley, California, USA, 183--192. Google ScholarDigital Library
- Ahmad, M., Duan, S., Aboulnaga, A., and Babu, S., 2011. Predicting completion times of batch query workloads using interaction-aware models and simulation. In Proceedings of the 14th International Conference on Extending Database Technology (EDBT'11) ACM, Uppsala, Sweden, 449--460. Google ScholarDigital Library
- Amazon, EC2 Instance Types. http://aws.amazon.com/ec2/instance-types/.Google Scholar
- Amazon, Elastic Compute Cloud (EC2). http://aws.amazon.com/ec2/.Google Scholar
- Ben-Hur, A. and Weston, J., 2010. A user's guide to support vector machines. Methods in Molecular Biology 609, 2, 223--239.Google ScholarCross Ref
- Chang, C.-C. and Lin, C.-J., 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3, 1--27. Google ScholarDigital Library
- Courtois, M. and Woodside, M., 2000. Using regression splines for software performance analysis. In Proceedings of the 2nd international workshop on Software and performance ACM, Ottawa, Ontario, Canada, 105--114. Google ScholarDigital Library
- Ganapathi, A., Kuno, H., Dayal, U., Wiener, J.L., Fox, A., Jordan, M., and Patterson, D., 2009. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In IEEE 25th International Conference on Data Engineering, 2009. (ICDE '09). IEEE, Shanghai, China, 592--603. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4812438&tag=1. Google ScholarDigital Library
- Gupta, C., Mehta, A., and Dayal, U., 2008. PQR: Predicting Query Execution Times for Autonomous Workload Management. In International Conference on Autonomic Computing, 2008. (ICAC '08). IEEE, Chicago, IL 13--22. Google ScholarDigital Library
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I.H., 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1, 10--18. Google ScholarDigital Library
- Han, J., Kamber, M., and Pei, J., 2012. Data mining: concepts and techniques (Third Edition). Morgan Kaufmann. Google ScholarDigital Library
- Hicks, C.R. and Turner Jr, K., 1999. Fundamental concepts in the design of experiments. Oxford University Press, New York.Google Scholar
- Hsu, C.W., Chang, C.C., and Lin, C.J., 2003. A practical guide to support vector classification. National Taiwan University. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.Google Scholar
- Kelly, T., 2005. Detecting performance anomalies in global applications. In Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2 USENIX Association, San Francisco, CA, 42--47. Google ScholarDigital Library
- Mian, R. and Martin, P., 2012. Executing data-intensive workloads in a Cloud. In CCGrid Doctoral Symposium 2012 in conjuction with 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, Canada, 758--763. Google ScholarDigital Library
- Mian, R., Martin, P., and Vazquez-Poletti, J.L., 2012. Provisioning data analytic workloads in a cloud. Future Generation Computer Systems (FGCS), in press http://dx.doi.org/10.1016/j.future.2012.1001.1008. Google ScholarDigital Library
- Osborne, J.W. and Waters, E., 2002. Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research & Evaluation 8, 2, 1--9.Google Scholar
- Pelleg, D. and Moore, A.W., 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In Proceedings of the Seventeenth International Conference on Machine Learning Morgan Kaufmann Publishers Inc., 727--734. Google ScholarDigital Library
- Platt, J., 1998. Sequential Minimal Optimization (SMO): A fast algorithm for training support vector machines. Microsoft Research. http://www.bradblock.com/Sequential_Minimal_Optimization_A_Fast_Algorithm_for_Training_Support_Vector_Machine.pdf.Google Scholar
- Raatikainen, K.E.E., 1993. Cluster analysis and workload classification. SIGMETRICS Perform. Eval. Rev. 20, 4, 24--30. Google ScholarDigital Library
- Rasmussen, C.E. and Williams, C.K.I., 2006. Gaussian Processes for Machine Learning. The MIT Press. Google ScholarDigital Library
- Schad, J., Dittrich, J., and Quiane-Ruiz, J.-A., 2010. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3, 1--2, 460--471. Google ScholarDigital Library
- Sheikh, M.B., Minhas, U.F., Khan, O.Z., Aboulnaga, A., Poupart, P., and Taylor, D.J., 2011. A bayesian approach to online performance modeling for database appliances using gaussian models. In 8th ACM international conference on Autonomic computing (ICAC) ACM, Karlsruhe, Germany, 121--130. Google ScholarDigital Library
- Thereska, E., Narayanan, D., and Ganger, G.R., 2006. Towards self-predicting systems: What if you could ask 'what-if'? The Knowledge Engineering Review 21, 03, 261--267. Google ScholarDigital Library
- Tozer, S., Brecht, T., and Aboulnaga, A., 2010. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In IEEE 26th International Conference on Data Engineering (ICDE), Long Beach, CA, USA, 397--408.Google Scholar
- TPC-C, Order Processing Benchmark. http://www.tpc.org/tpcc/.Google Scholar
- TPC-E, Detailed description. http://www.tpc.org/tpce/.Google Scholar
- TPC-E, Trading Benchmark. http://www.tpc.org/tpce/.Google Scholar
- TPC-H, Decision Support Benchmark. http://www.tpc.org/tpch/.Google Scholar
- Tsang, I.W., Kwok, J.T., and Cheung, P.-M., 2005. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research 6, 363--392. Google ScholarDigital Library
- Weikum, G., Moenkeberg, A., Hasse, C., and Zabback, P., 2002. Self-tuning database technology and information services: from wishful thinking to viable engineering. In Proceedings of the 28th international conference on Very Large Data Bases VLDB Endowment, Hong Kong, China, 20--31. Google ScholarDigital Library
- Weissman, C.D. and Bobrowski, S., 2009. The design of the force.com multitenant internet application development platform. In Proceedings of the 35th SIGMOD international conference on Management of data ACM, Providence, Rhode Island, USA. http://dl.acm.org/citation.cfm?id=1559942. Google ScholarDigital Library
- Witten, I.H., Frank, E., and Hall, M.A., 2011. Data Mining: Practical machine learning tools and techniques (3rd edition). Morgan Kaufmann Pub. Google ScholarDigital Library
- Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., and Steinberg, D., 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1, 1--37. Google ScholarDigital Library
- Zhang, M., Martin, P., Powley, W., Bird, P., and McDonald, K., 2012. Discovering Indicators for Congestion in DBMSs. In Proceedings of the International Workshop on Self-Managing Database Systems (SMDB'12) in Conjunction with the International Conference on Data Engineering (ICDE'12), Washington, DC, USA, in press. Google ScholarDigital Library
- Zhang, M., Niu, B., Martin, P., Powley, W., Bird, P., and McDonald, K., 2011. Utility Function-based Workload Management for DBMSs. In Proceedings of the 7th International Conference on Autonomic and Autonomous Systems (ICAS 2011), Mestre, Italy, 116--121.Google Scholar
- Zhang, Q., Cherkasova, L., Mathews, G., Greene, W., and Smirni, E., 2007. R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads Middleware 2007. Lecture Notes in Computer Science 4834, 244--265. Google ScholarDigital Library
Index Terms
Towards building performance models for data-intensive workloads in public clouds
Recommendations
Estimating resource costs of data-intensive workloads in public clouds
MGC '12: Proceedings of the 10th International Workshop on Middleware for Grids, Clouds and e-ScienceThe promise of "infinite" resources given by the cloud computing paradigm has led to recent interest in exploiting clouds for large-scale data-intensive computing. In this paper, we present a model to estimate the resource costs for executing data-...
A placement vulnerability study in multi-tenant public clouds
SEC'15: Proceedings of the 24th USENIX Conference on Security SymposiumPublic infrastructure-as-a-service clouds, such as Amazon EC2, Google Compute Engine (GCE) and Microsoft Azure allow clients to run virtual machines (VMs) on shared physical infrastructure. This practice of multi-tenancy brings economies of scale, but ...
Performance Evaluation of Data Intensive Computing In the Cloud
Big data is a topic of active research in the cloud community. With increasing demand for data storage in the cloud, study of data-intensive applications is becoming a primary focus. Data-intensive applications involve high CPU usage for processing ...
Comments