skip to main content
survey

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Published:13 September 2019Publication History
Skip Abstract Section

Abstract

Interest in processing big data has increased rapidly to gain insights that can transform businesses, government policies, and research outcomes. This has led to advancement in communication, programming, and processing technologies, including cloud computing services and technologies such as Hadoop, Spark, and Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but composed of several individual analytical steps running in the form of a workflow. These big data workflows are vastly different in nature from traditional workflows. Researchers are currently facing the challenge of how to orchestrate and manage the execution of such workflows. In this article, we discuss in detail orchestration requirements of these workflows as well as the challenges in achieving these requirements. We also survey current trends and research that supports orchestration of big data workflows and identify open research challenges to guide future developments in this area.

Skip Supplemental Material Section

Supplemental Material

References

  1. {n.d.}. Chapter 15 - A taxonomy and survey of fault-tolerant workflow manag. sys. in cloud and dist. computing env. In Software Architecture for Big Data and the Cloud, Ivan Mistrik, Rami Bahsoon, Nour Ali, Maritta Heisel, and Bruce Maxim (Eds.). Morgan Kaufmann.Google ScholarGoogle Scholar
  2. 2015. Anomaly Detection over Sensor Data Streams. Retrieved from http://wiki.clommunity-project.eu/pilots:and.Google ScholarGoogle Scholar
  3. Adamu et al. 2016. A Survey on Big Data Indexing Strategies. Technical Report. SLAC National Accelerator Lab., Menlo Park, CA.Google ScholarGoogle Scholar
  4. Ahmad et al. 2014. Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In Proceedings of the 4th International Conference on Big Data and Cloud Computing (BdCloud). IEEE, 129--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ahmad et al. 2017. Optim. of data-intensive workflows in stream-based data process. models. J Supercomput. 73, 9 (2017), 3901--3923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain. 2012. Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alrokayan et al. 2014. Sla-aware provisioning and scheduling of cloud resources for big data analytics. In CCEM. IEEE, 1--8.Google ScholarGoogle Scholar
  8. Amazon. 2017. AWS Lambda. Retrieved from https://aws.amazon.com/lambda/details/.Google ScholarGoogle Scholar
  9. Amstutz et al. 2016. Common workflow language, draft 3.Google ScholarGoogle Scholar
  10. Beloglazov et al. 2012. Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Gener. Comput. Syst. 28, 5 (2012), 755--768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bessani et al. 2013. DepSky: Dependable and secure storage in a cloud-of-clouds. ACM Trans. Storage (TOS) 9, 4 (2013), 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bessani et al. 2014. SCFS: A shared cloud-backed file system. In USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bhuvaneshwar et al. 2015. A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput. Struct. Biotechnology J. 13 (2015), 64--74.Google ScholarGoogle ScholarCross RefCross Ref
  14. Bicer et al. 2013. Integrating online compression to accelerate large-scale data analytics applications. In Proceedings of the 27th International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE, 1205--1216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bohli et al. 2013. Security and privacy-enhancing multicloud arch. IEEE Trans. Dependable Secure Comput. 10, 4 (2013), 212--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marc Bux and Ulf Leser. 2013. Parallelization in scientific workflow management systems. arXiv preprint arXiv:1303.7195 (2013).Google ScholarGoogle Scholar
  17. Massimo Cafaro and Giovanni Aloisio. 2011. Grids, clouds, and virtualization. In Grids, Clouds and Virtualization. Springer, 1--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cai et al. 2017. IoT-based big data storage systems in cloud comp.: Perspectives and challenges. IEEE IoT J. 4, 1 (2017), 75--87.Google ScholarGoogle Scholar
  19. Cao et al. 2016. A resource provisioning strategy for elastic analytical workflows in the cloud. In Proceedings of the 18th International Conference on High-Performance Computing and Communications, 14th International Conference on Smart City, and 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 538--545.Google ScholarGoogle Scholar
  20. Chen et al. 2013. Big data challenge: A data management perspective. Front. Comput. Sci. 7, 2 (2013), 157--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chen et al. 2018. Scheduling jobs across geo-distributed datacenters with max-min fairness. IEEE Trans. Network Sci.Eng. (2018). PrePrints.Google ScholarGoogle Scholar
  22. CL Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf. Sci. 275 (2014), 314--347.Google ScholarGoogle ScholarCross RefCross Ref
  23. Peng Chen. 2016. Big data analytics in static and streaming provenance.Google ScholarGoogle Scholar
  24. Weiwei Chen and Ewa Deelman. 2011. Partitioning and scheduling workflows across multiple sites with storage constraints. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Weiwei Chen and Ewa Deelman. 2012. Integration of workflow partitioning and resource provisioning. In Proceedings of the 12th International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 764--768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Condie et al. 2010. MapReduce online. In NSDI, Vol. 10. 20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Convolbo et al. 2018. GEODIS: Towards optim. of data locality-aware job sched. in geo-distrib. datacenters. Comput. 100, 1 (2018), 21--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Costa et al. 2011. Byzantine fault-tolerant MapReduce: Faults are not just crashes. In Proceedings of the 3rd International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 32--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Costa et al. 2014. Towards an adaptive and distributed architecture for managing workflow provenance data. In Proceedings of the 10th International Conference on e-Science (e-Science), Vol. 2. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Alfredo Cuzzocrea. 2014. Privacy and security of big data: Current challenges and future research perspectives. In Proceedings of the 1st International Workshop on Privacy and Secuirty of Big Data. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Demchenko et al. 2017. Defining intercloud security framework and architecture components for multi-cloud data intensive applications. In Proceedings of the 17th International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 945--952. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dong et al. 2013. COLO: COarse-grained LOck-stepping virtual machines for non-stop service. In Proceedings of the 4th Annual Symposium on Cloud Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dong et al. 2017. Betrayal, distrust, and rationality: Smart counter-collusion contracts for verifiable cloud computing. In Proceedings of the SIGSAC Conference on Computer and Communications Security. ACM, 211--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ebrahimi et al. 2015. TPS: A task placement strategy for big data workflows. In Proceedings of the International Conference on Big Data (Big Data). IEEE, 523--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ahmed Eldawy and Mohamed F. Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In Proceedings of the IEEE 31st International Conference on Data Engineering (ICDE’15). IEEE, 1352--1363.Google ScholarGoogle Scholar
  37. Fernando et al. 2018. WorkflowDSL: Scalable workflow execution with provenance for data analysis applications. In Proceedings of the 42nd Annual Computer Software and Applications Conference (COMPSAC). IEEE, 774--779.Google ScholarGoogle ScholarCross RefCross Ref
  38. Filgueira et al. 2016. Asterism: Pegasus and dispel4py hybrid workflows for data-intensive science. In Proceedings of the 7th International Workshop on Data-Intensive Computing in the Cloud. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rosa Filgueira, Amrey Krause, Malcolm Atkinson, Iraklis Klampanos, Alessandro Spinuso, and Susana Sanchez-Exposito. 2015. dispel4py: An agile framework for data-intensive escience. In Proceedings of the IEEE 11th International Conference on e-Science (e-Science’15). IEEE, 454--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rosa Filguiera, Amrey Krause, Malcolm Atkinson, Iraklis Klampanos, and Alexander Moreno. 2017. dispel4py: A Python framework for data-intensive scientific computing. Int. J. High Perform. Comput. Appl. 31, 4 (2017), 316--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wai-Tat Fu and Wei Dong. 2012. Collabor. indexing and knowledge explor.: A social learn. model. IEEE Intell. Syst. 27, 1 (2012), 39--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Gacto et al. 2010. Integration of an index to preserve the semantic interpretability in the multiobjective evolutionary rule selection and tuning of linguistic fuzzy systems. IEEE Trans. Fuzzy Syst. 18, 3 (2010), 515--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Gani et al. 2016. A survey on indexing techniques for big data: Taxonomy and performance evaluation. Knowl. Inf. Syst. 46, 2 (2016), 241--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Garg et al. 2018. Orchestration Tools for Big Data. Springer International Publishing, 1--9.Google ScholarGoogle Scholar
  45. Gerlach et al. 2014. Skyport: Container-based execution environment management for multi-cloud scientific workflows. In Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds. IEEE Press, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. George M. Giaglis. 2001. A taxonomy of business process modeling and information systems modeling techniques. Int. J. Flexible Manuf. Syst. 13, 2 (2001), 209--228.Google ScholarGoogle ScholarCross RefCross Ref
  47. Glavic et al. 2011. The case for fine-grained stream provenance. In BTW Workshops, Vol. 11.Google ScholarGoogle Scholar
  48. Glavic et al. 2014. Efficient stream provenance via operator instrumentation. ACM Trans. Internet Technol. (TOIT) 14, 1 (2014), 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Boris Glavic. 2014. Big data provenance: Challenges and implications for benchmarking. In Specifying Big Data Benchmarks. Springer, 72--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Gomes et al. 2018. Enabling rootless Linux containers in multi-user envin.: The udocker tool. Computer Physics Communications (2018).Google ScholarGoogle Scholar
  51. Gonidis et al. 2013. Cloud application portability: An initial view. In Proceedings of the 6th Balkan Conference in Informatics. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Hassan et al. 2017. Networks of the Future: Architectures, Technologies, and Implementations. Chapman and Hall/CRC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. He et al. 2016. Efficient and anonymous mobile user authentication protocol using self-certified public key cryptography for multi-server architectures. IEEE Trans. Inf. Forensics Secur. 11, 9 (2016), 2052--2064. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. He et al. 2018. A provably-secure cross-domain handshake scheme with symptoms-matching for mobile healthcare social network. IEEE Trans. Dependable and Secure Comput. 15, 4 (2018), 633--645.Google ScholarGoogle ScholarCross RefCross Ref
  55. Hirzel et al. 2013. IBM streams processing language: Analyzing big data in motion. IBM J. Res. Dev. 57, 3/4 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Hu et al. 2014. Toward scalable systems for big data analytics: A technology tutorial. IEEE Access 2 (2014), 652--687.Google ScholarGoogle ScholarCross RefCross Ref
  57. Hu et al. 2016. Flutter: Scheduling tasks closer to data across geo-distributed datacenters. In Proceedings of the 35th Annual IEEE INFOCOM. 1--9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Hung et al. 2015. Scheduling jobs across geo-distributed datacenters. In Proceedings of the 6th Symposium on Cloud Computing. ACM, 111--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Huq et al. 2011. Inferring fine-grained data provenance in stream data processing: Reduced storage cost, high accuracy. In Proceedings of the International Conference on Database and Expert Systems Applications. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Interlandi et al. 2017. Adding data provenance support to Apache Spark. The VLDB J. (2017), 1--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Matteo Interlandi and Tyson Condie. 2018. Supporting data provenance in data-intensive scalable comp. sys. Data Eng. (2018), 63.Google ScholarGoogle Scholar
  62. Michael Isard and Martín Abadi. 2015. Falkirk wheel: Rollback recovery for dataflow systems. arXiv preprint arXiv:1503.08877 (2015).Google ScholarGoogle Scholar
  63. Jin et al. 2016. Workload-aware scheduling across geo-distributed data centers. In Trustcom/BigDataSE/ISPA. IEEE, 1455--1462.Google ScholarGoogle Scholar
  64. Todd Jr. et al. 2017. Data analytics computing resource provisioning based on computed cost and time parameters for proposed computing resource configurations. US Patent 9,684,866.Google ScholarGoogle Scholar
  65. Jrad et al. 2012. SLA based service brokering in intercloud environments. CLOSER 2012 (2012), 76--81.Google ScholarGoogle Scholar
  66. Jrad et al. 2013. A broker-based framework for multi-cloud workflows. In Proceedings of the Intern. Workshop on Multi-cloud Applications and Federated Clouds. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Andrey Kashlev and Shiyong Lu. 2014. A system architecture for running big data workflows in the cloud. In Proceedings of the International Conference on Services Computing (SCC). IEEE, 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Kaur et al. 2017. Container-as-a-service at the edge: Trade-off between energy efficiency and service availability at fog nano data centers. IEEE Wireless Commun. 24, 3 (2017), 48--56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Tyler Keenan. 2016. Streaming Data: Big Data at High Velocity. Retrieved from https://www.upwork.com/hiring/data/streaming-data-high-velocity/.Google ScholarGoogle Scholar
  70. Kiran et al. 2015. Lambda architecture for cost-effective batch and speed bigdata process. In Proceedings of the International Conference on Big Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Komkhao et al. 2013. Incremental collaborative filtering based on Mahalanobis distance and fuzzy membership for recommender systems. Int. J. Gen. Syst. 42, 1 (2013), 41--66.Google ScholarGoogle ScholarCross RefCross Ref
  72. Kurtzer et al. 2017. Singularity: Scientific containers for mobility of compute. PloS One 12, 5 (2017), e0177459.Google ScholarGoogle ScholarCross RefCross Ref
  73. Palden Lama and Xiaobo Zhou. 2012. Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Li et al. 2017. Study on fault tolerance method in cloud platform based on workload consolidation model of virtual machine. J. Eng. Sci. Technol. Rev. 10, 5 (2017), 41--49.Google ScholarGoogle ScholarCross RefCross Ref
  75. Lin et al. 2016. StreamScope: Continuous reliable distributed processing of big data streams. In NSDI. 439--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Liu et al. 2014. Scientific workflow partitioning in multisite cloud. In Proceedings of the European Conference on Parallel Processing. Springer, 105--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Liu et al. 2015. A survey of data-intensive scientific workflow management. J. Grid Comput. 13, 4 (2015), 457--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Liu et al. 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS J. PRS 115 (2016), 134--142.Google ScholarGoogle ScholarCross RefCross Ref
  79. Liu et al. 2018. A survey of scheduling frameworks in big data systems. Int. J. Cloud Comput. (2018), 1--27.Google ScholarGoogle Scholar
  80. Yang Liu and Wei Wei. 2015. A replication-based mechanism for fault tolerance in mapreduce framework. Math. Prob. Eng. 2015 (2015).Google ScholarGoogle Scholar
  81. Rache lKempf. 2017. Open Source Data Pipeline—Luigi vs Azkaban vs Oozie vs Airflow. Retrieved from https://www.bizety.com/2017/06/05/open-source-data-pipeline-luigi-vs-azkaban-vs-oozie-vs-airflow/.Google ScholarGoogle Scholar
  82. Lopez et al. 2016. A performance comparison of Open-Source stream processing platforms. In Proceedings of the Global Communications Conference (GLOBECOM).Google ScholarGoogle ScholarCross RefCross Ref
  83. Dan Lynn. 2016. Apache Spark Cluster Managers: YARN, Mesos, or Standalone? Retrieved from http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/.Google ScholarGoogle Scholar
  84. Ma et al. 2012. An efficient index for massive IOT data in cloud environment. In Proceedings of the 21st International Conference on IKM. 2129--2133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Mace et al. 2011. The case for dynamic security solutions in public cloud workflow deployments. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). 111--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Malik et al. 2010. Tracking and sketching distributed data provenance. In Proceedings of the 6th International Conference on e-Science. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Mansouri et al. 2017. Data storage management in cloud envirn.: Taxonomy, survey, and future directions. ACM CSUR 50, 6 (2017), 1--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Di Martino et al. 2015. Cross-platform cloud APIs. In Cloud Portability and Interoperability. Springer, 45--57.Google ScholarGoogle Scholar
  89. Ulf Mattsson. 2016. Data centric security key to cloud and digital business. Retrieved from https://www.helpnetsecurity.com/2016/03/22/data-centric-security/.Google ScholarGoogle Scholar
  90. Mikami et al. 2011. Using the Gfarm file system as a POSIX compatible storage platform for Hadoop MapReduce applications. In Proceedings of the12th IEEE/ACM International Conference on Grid Computing (GRID). IEEE, 181--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Mohan et al. 2016. A NOSQL data model for scalable big data workflow execution. In Proceedings of the International Congress on Big Data (BigData Congress).Google ScholarGoogle ScholarCross RefCross Ref
  92. Mon et al. 2016. Clustering based on task dependency for data-intensive workflow scheduling optimization. In Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS). IEEE, 20--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Nachiappan et al. 2017. Cloud storage reliability for big data applications: A state of the art survey. J. Netw. Comput. Appl. 97 (2017), 35--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Matri et al. 2016. Tỳr: Efficient Transactional Storage for Data-Intensive Applications. Ph.D. Dissertation. Inria Rennes Bretagne Atlantique; Universidad Politécnica de Madrid.Google ScholarGoogle Scholar
  95. Suraj Pandey and Rajkumar Buyya. 2012. A survey of scheduling and management techniques for data-intensive application workflows. In Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management. IGI Global, 156--176.Google ScholarGoogle Scholar
  96. Park et al. 2011. Ramp: A system for capturing and tracing provenance in mapreduce workflows. In Proceedings of 37th International Conference on Very Large Data Bases (VLDB’11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Pawluk et al. 2012. Introducing STRATOS: A cloud broker service. In Proceedings of the 5th International Conference on Cloud Computing (CLOUD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Peoples et al. 2013. The standardisation of cloud computing: Trends in the state-of-the-art and management issues for the next generation of cloud. In Proceedings of the Science and Information Conference (SAI). IEEE.Google ScholarGoogle Scholar
  99. Poola et al. 2014. Fault-tolerant workflow scheduling using spot instances on clouds. Procedia Comput. Sci. 29 (2014), 523--533.Google ScholarGoogle ScholarCross RefCross Ref
  100. Poola et al. 2016. Enhancing reliability of workflow execution using task replication and spot instances. ACM Trans. Auton. Adapt. Syst. (TAAS) 10, 4 (2016), 1--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Qasha et al. 2016. Dynamic deployment of scientific workflows in the cloud using container virtualization. In Proceedings of the International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 269--276.Google ScholarGoogle ScholarCross RefCross Ref
  102. Rahman et al. 2011. A taxonomy and survey on autonomic management of applications in grid computing environments. Concurrency Comput. Pract. Experience 23, 16 (2011), 1990--2019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Ranjan et al. 2015. Cross-layer cloud resource configuration selection in the big data era. IEEE Cloud Comput. 2, 3 (2015), 16--22.Google ScholarGoogle ScholarCross RefCross Ref
  104. Ranjan et al. 2017. Orchestrating BigData analysis workflows. IEEE Cloud Comput. 4, 3 (2017), 20--28.Google ScholarGoogle ScholarCross RefCross Ref
  105. Rao et al. 2019. The big data system, components, tools, and technologies: A survey. Knowl. Inf. Syst. 60, 3 (2019), 1165--1245.Google ScholarGoogle ScholarCross RefCross Ref
  106. K. H. K. Reddy and D. S. Roy. 2015. Dppacs: A novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59, 1 (2015), 64--82.Google ScholarGoogle Scholar
  107. Maria Alejandra Rodriguez and Rajkumar Buyya. 2017. A taxonomy and survey on scheduling algorithms for scientific workflows in IaaS cloud computing environments. Concurrency Comput. Pract. Experience 29, 8 (2017).Google ScholarGoogle Scholar
  108. Rodríguez-García et al. 2014. Creating a semantically-enhanced cloud services environment through ontology evolution. Future Gener. Comput. Syst. 32 (2014), 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Sakr et al. 2011. A survey of large scale data management approaches in cloud envirns. IEEE Commun. Surv. Tutorials 13, 3 (2011), 311--336.Google ScholarGoogle ScholarCross RefCross Ref
  110. Sakr et al. 2013. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. (CSUR) 46, 1 (2013), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Sansrimahachai et al. 2013. An on-the-fly provenance tracking mechanism for stream processing systems. In Proceedings of the 12th International Conference on Computer and Information Science (ICIS). IEEE, 475--481.Google ScholarGoogle ScholarCross RefCross Ref
  112. Seiger et al. 2018. Toward an execution system for self-healing workflows in cyber-physical systems. Software 8 Syst. Model. 17, 2 (2018), 551--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Shishido et al. 2018. (WIP) tasks selection policies for securing sensitive data on workflow scheduling in clouds. In IEEE SCC.Google ScholarGoogle Scholar
  114. Silva et al. 2018. DfAnalyzer: Runtime dataflow analysis of scientific applications using provenance. VLDB Endowment 11, 12 (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. Souza et al. 2018. Hybrid adaptive checkpointing for VM fault tolerance. In Proceedings of the International Conference on Cloud Engineering (IC2E).Google ScholarGoogle Scholar
  116. Mesos Sphere. 2017. Apache Mesos. Retrieved from https://mesosphere.com/why-mesos/?utm_source=adwords8utm_medium=g8utm_campaign=438435124318utm_term=mesos8utm_content=1908059572258gclid=CLqw8o6J6dMCFdkGKgodYlsD_A.Google ScholarGoogle Scholar
  117. Sun et al. 2017. Building a fault tolerant framework with deadline guarantee in big data stream computing environments. J. Comput. Syst. Sci. 89 (2017), 4--23.Google ScholarGoogle ScholarCross RefCross Ref
  118. Sun et al. 2018. Rethinking elastic online scheduling of big data streaming applications over high-velocity continuous data streams. J. Supercomputing 74, 2 (2018), 615--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Dawei Sun and Rui Huang. 2016. A stable online scheduling strategy for real-time stream computing over fluctuating big data streams. IEEE Access 4 (2016), 8593--8607.Google ScholarGoogle ScholarCross RefCross Ref
  120. Talbi et al. 2012. Multi-objective optimization using metaheuristics: Non-standard algorithms. Int. Trans. Oper. Res. 19, 1-2 (2012), 283--305.Google ScholarGoogle ScholarCross RefCross Ref
  121. Tan et al. 2014. Diff-Index: Differentiated index in distributed log-structured data stores. In EDBT. 700--711.Google ScholarGoogle Scholar
  122. Toosi et al. 2018. Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using Aneka. Future Gener. Comput. Syst. 79, 2 (2018), 765--775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Tudoran et al. 2016. Overflow: Multi-site aware big data management for scientific workflows on clouds. IEEE TCC 4, 1 (2016), 76--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Ulmer et al. 2018. Faodel: Data management for next-generation application workflows. In Proceedings of the 9th Workshop on Scientific Cloud Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Wil M. P. Van Der Aalst and Arthur HM Ter Hofstede. 2005. YAWL: Yet another workflow language. Inf. Syst. 30, 4 (2005), 245--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Vavilapalli et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Venkataraman et al. 2017. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 374--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Nithya Vijayakumar and Beth Plale. 2007. Tracking stream provenance in complex event processing systems for workflow-driven computing. In Proceedings of the EDA-PS Workshop.Google ScholarGoogle Scholar
  129. Vishwakarma et al. 2014. An eff. approach for inverted index pruning based on document relevance. In Proceedings of the 4th International Conference on CSNT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. von Leon et al. 2019. A lightweight container middleware for edge cloud architectures. Fog and Edge Computing: Principles and Paradigms (2019), 145--170.Google ScholarGoogle Scholar
  131. Vrable et al. 2012. BlueSky: A cloud-backed file system for the enterprise. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Wang et al. 2014. Optimizing load balancing and data-locality with data-aware scheduling. In Proceedings of the International Conference on Big Data (Big Data).Google ScholarGoogle ScholarCross RefCross Ref
  133. Wang et al. 2015. WaFS: A workflow-aware file system for effective storage utilization in the cloud. IEEE Trans. Comput. 64, 9 (2015), 2716--2729.Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Wang et al. 2016. Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency Comput. Pract. Experience 28, 1 (2016), 70--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Wen et al. 2017. Cost effective, reliable and secure workflow deployment over federated clouds. IEEE TSC. 10, 6 (2017), 929--941.Google ScholarGoogle Scholar
  136. Wu et al. 2010. Analyses of multi-level and component compressed bitmap indexes. ACM Trans. Database Syst. 35, 1 (2010), 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Wu et al. 2015. Workflow scheduling in cloud: A survey. J. Supercomput. 71, 9 (2015), 3373--3418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Xu et al. 2017. On fault tolerance for distributed iterative dataflow processing. IEEE Trans. KDE 29, 8 (2017), 1709--1722.Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Yıldırım et al. 2012. GRAIL: A scalable index for reachability queries in very large graphs. VLDB J. 21, 4 (2012), 509--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Yu et al. 2014. An efficient multidimension metadata index and search system for cloud data. In Proceedings of the 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 499--504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Jia Yu and Rajkumar Buyya. 2005. A taxonomy of scientific workflow systems for grid computing. ACM Sigmod Record 34, 3 (2005), 44--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. Zhang et al. 2013. A survey on cloud interoperability: taxon., stand., and practice. ACM SIGMETRICS Perf. Eval. Rev. 40, 4 (2013), 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Zhao et al. 2014. Devising a cloud scientific workflow platform for big data. In World Congress on Services (SERVICES). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Zhao et al. 2015. A data placement strategy for data-intensive scientific workflows in cloud. In Proceedings of the 15th IEEE/ACM CCGRID. 928--934. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Zhao et al. 2015. Enabling scalable scientific workflow management in the Cloud. Future Gener. Comput. Syst. 46 (2015), 3--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. Zhao et al. 2015. SLA-based resource scheduling for big data analytics as a service in cloud computing environments. In Proceedings of the 44th International Conference on Parallel Processing (ICPP). IEEE, 510--519. Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. Zhao et al. 2016. Heuristic data placement for data-intensive applications in heterogeneous cloud. JECE (2016).Google ScholarGoogle Scholar
  148. Zhao et al. 2016. A new energy-aware task scheduling method for data-intensive applications in the cloud. JNCA 59 (2016), 14--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  149. Charles Zheng and Douglas Thain. 2015. Integrating containers into workflows: A case study using makeflow, work queue, and docker. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing. ACM, 31--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. Chaochao Zhou and Saurabh Kumar Garg. 2015. Performance analysis of scheduling algorithms for dynamic workflow applications. In Proceedings of the International Congress on Big Data (BigData Congress). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. Zhu et al. 2016. Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Trans. Parallel Distrib. Syst. 27, 12 (2016), 3501--3517. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

          Recommendations

          Reviews

          Dominik Strzalka

          When processing different big data workflows, many new and (so far) unknown patterns and performance requirements are visible. We are forced to search new processing models and management techniques that can support the design of different aspects of big data workflows: infrastructure (hardware), platforms (software), and efficient methods for scheduling and deployment workflows. These big, serious, scientific, technological, organizational, and technical problems lead to at least three important challenges (research questions) that are expanded and developed in this paper: (1) A description of "the different models and fundamental requirements of big data workflow applications"; (2) The new challenges related to the cloud and edge data centers, and this type of workflow application; and (3) The known "approaches, techniques, tools, and technologies" for developing "a new big data orchestration system." In successive sections, the authors present different research challenges, an existing knowledge and approaches survey, and possible future development directions for orchestrating big data analysis workflows. They give a detailed overview of many different issues related to workflow orchestration, big data workflow requirements in the cloud, a new big data workflow application classification and research taxonomy, currently used techniques and approaches, different systems and examples with data workflow support, and some still open challenges in the field. The paper is supported with many valuable literature references, which show a general outline and the state of the art in big data computer systems organization and computing methodologies. This survey is a comprehensive analysis of many important and sometimes secondary issues, which suggests we may be facing an important paradigm shift in computer systems processing.

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Computing Surveys
            ACM Computing Surveys  Volume 52, Issue 5
            September 2020
            791 pages
            ISSN:0360-0300
            EISSN:1557-7341
            DOI:10.1145/3362097
            • Editor:
            • Sartaj Sahni
            Issue’s Table of Contents

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 September 2019
            • Accepted: 1 May 2019
            • Revised: 1 March 2019
            • Received: 1 October 2018
            Published in csur Volume 52, Issue 5

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • survey
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format