survey

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Authors:
Mutaz Barika

University of Tasmania, Tasmania, Australia

University of Tasmania, Tasmania, Australia

0000-0002-9146-2459
View Profile

,
Saurabh Garg

University of Tasmania, Tasmania, Australia

University of Tasmania, Tasmania, Australia
View Profile

,
Albert Y. Zomaya

University of Sydney, New South Wales, Australia

University of Sydney, New South Wales, Australia
View Profile

,
Lizhe Wang

China University of Geoscience (Wuhan), Wuhan, P. R China

China University of Geoscience (Wuhan), Wuhan, P. R China
View Profile

,
Aad Van Moorsel

Newcastle University, United Kingdom

Newcastle University, United Kingdom
View Profile

,
Rajiv Ranjan

China University of Geoscience (Wuhan) and Newcastle University, United Kingdom

China University of Geoscience (Wuhan) and Newcastle University, United Kingdom
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 52 Issue 5Article No.: 95pp 1–41https://doi.org/10.1145/3332301

Published:13 September 2019Publication History

ACM Computing Surveys

Abstract

Interest in processing big data has increased rapidly to gain insights that can transform businesses, government policies, and research outcomes. This has led to advancement in communication, programming, and processing technologies, including cloud computing services and technologies such as Hadoop, Spark, and Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but composed of several individual analytical steps running in the form of a workflow. These big data workflows are vastly different in nature from traditional workflows. Researchers are currently facing the challenge of how to orchestrate and manage the execution of such workflows. In this article, we discuss in detail orchestration requirements of these workflows as well as the challenges in achieving these requirements. We also survey current trends and research that supports orchestration of big data workflows and identify open research challenges to guide future developments in this area.

Supplemental Material

Available for Download

zip

barika.zip (994.4 KB)

Supplemental movie, appendix, image and software files for, Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

References

{n.d.}. Chapter 15 - A taxonomy and survey of fault-tolerant workflow manag. sys. in cloud and dist. computing env. In Software Architecture for Big Data and the Cloud, Ivan Mistrik, Rami Bahsoon, Nour Ali, Maritta Heisel, and Bruce Maxim (Eds.). Morgan Kaufmann.Google Scholar
2015. Anomaly Detection over Sensor Data Streams. Retrieved from http://wiki.clommunity-project.eu/pilots:and.Google Scholar
Adamu et al. 2016. A Survey on Big Data Indexing Strategies. Technical Report. SLAC National Accelerator Lab., Menlo Park, CA.Google Scholar
Ahmad et al. 2014. Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In Proceedings of the 4th International Conference on Big Data and Cloud Computing (BdCloud). IEEE, 129--136. Google ScholarDigital Library
Ahmad et al. 2017. Optim. of data-intensive workflows in stream-based data process. models. J Supercomput. 73, 9 (2017), 3901--3923. Google ScholarDigital Library
Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain. 2012. Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. Google ScholarDigital Library
Alrokayan et al. 2014. Sla-aware provisioning and scheduling of cloud resources for big data analytics. In CCEM. IEEE, 1--8.Google Scholar
Amazon. 2017. AWS Lambda. Retrieved from https://aws.amazon.com/lambda/details/.Google Scholar
Amstutz et al. 2016. Common workflow language, draft 3.Google Scholar
Beloglazov et al. 2012. Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Gener. Comput. Syst. 28, 5 (2012), 755--768. Google ScholarDigital Library
Bessani et al. 2013. DepSky: Dependable and secure storage in a cloud-of-clouds. ACM Trans. Storage (TOS) 9, 4 (2013), 12. Google ScholarDigital Library
Bessani et al. 2014. SCFS: A shared cloud-backed file system. In USENIX Annual Technical Conference. Google ScholarDigital Library
Bhuvaneshwar et al. 2015. A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput. Struct. Biotechnology J. 13 (2015), 64--74.Google ScholarCross Ref
Bicer et al. 2013. Integrating online compression to accelerate large-scale data analytics applications. In Proceedings of the 27th International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE, 1205--1216. Google ScholarDigital Library
Bohli et al. 2013. Security and privacy-enhancing multicloud arch. IEEE Trans. Dependable Secure Comput. 10, 4 (2013), 212--224. Google ScholarDigital Library
Marc Bux and Ulf Leser. 2013. Parallelization in scientific workflow management systems. arXiv preprint arXiv:1303.7195 (2013).Google Scholar
Massimo Cafaro and Giovanni Aloisio. 2011. Grids, clouds, and virtualization. In Grids, Clouds and Virtualization. Springer, 1--21. Google ScholarDigital Library
Cai et al. 2017. IoT-based big data storage systems in cloud comp.: Perspectives and challenges. IEEE IoT J. 4, 1 (2017), 75--87.Google Scholar
Cao et al. 2016. A resource provisioning strategy for elastic analytical workflows in the cloud. In Proceedings of the 18th International Conference on High-Performance Computing and Communications, 14th International Conference on Smart City, and 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 538--545.Google Scholar
Chen et al. 2013. Big data challenge: A data management perspective. Front. Comput. Sci. 7, 2 (2013), 157--164. Google ScholarDigital Library
Chen et al. 2018. Scheduling jobs across geo-distributed datacenters with max-min fairness. IEEE Trans. Network Sci.Eng. (2018). PrePrints.Google Scholar
CL Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf. Sci. 275 (2014), 314--347.Google ScholarCross Ref
Peng Chen. 2016. Big data analytics in static and streaming provenance.Google Scholar
Weiwei Chen and Ewa Deelman. 2011. Partitioning and scheduling workflows across multiple sites with storage constraints. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. Springer. Google ScholarDigital Library
Weiwei Chen and Ewa Deelman. 2012. Integration of workflow partitioning and resource provisioning. In Proceedings of the 12th International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 764--768. Google ScholarDigital Library
Condie et al. 2010. MapReduce online. In NSDI, Vol. 10. 20. Google ScholarDigital Library
Convolbo et al. 2018. GEODIS: Towards optim. of data locality-aware job sched. in geo-distrib. datacenters. Comput. 100, 1 (2018), 21--46. Google ScholarDigital Library
Costa et al. 2011. Byzantine fault-tolerant MapReduce: Faults are not just crashes. In Proceedings of the 3rd International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 32--39. Google ScholarDigital Library
Costa et al. 2014. Towards an adaptive and distributed architecture for managing workflow provenance data. In Proceedings of the 10th International Conference on e-Science (e-Science), Vol. 2. IEEE. Google ScholarDigital Library
Alfredo Cuzzocrea. 2014. Privacy and security of big data: Current challenges and future research perspectives. In Proceedings of the 1st International Workshop on Privacy and Secuirty of Big Data. ACM. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Demchenko et al. 2017. Defining intercloud security framework and architecture components for multi-cloud data intensive applications. In Proceedings of the 17th International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 945--952. Google ScholarDigital Library
Dong et al. 2013. COLO: COarse-grained LOck-stepping virtual machines for non-stop service. In Proceedings of the 4th Annual Symposium on Cloud Computing. Google ScholarDigital Library
Dong et al. 2017. Betrayal, distrust, and rationality: Smart counter-collusion contracts for verifiable cloud computing. In Proceedings of the SIGSAC Conference on Computer and Communications Security. ACM, 211--227. Google ScholarDigital Library
Ebrahimi et al. 2015. TPS: A task placement strategy for big data workflows. In Proceedings of the International Conference on Big Data (Big Data). IEEE, 523--530. Google ScholarDigital Library
Ahmed Eldawy and Mohamed F. Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In Proceedings of the IEEE 31st International Conference on Data Engineering (ICDE’15). IEEE, 1352--1363.Google Scholar
Fernando et al. 2018. WorkflowDSL: Scalable workflow execution with provenance for data analysis applications. In Proceedings of the 42nd Annual Computer Software and Applications Conference (COMPSAC). IEEE, 774--779.Google ScholarCross Ref
Filgueira et al. 2016. Asterism: Pegasus and dispel4py hybrid workflows for data-intensive science. In Proceedings of the 7th International Workshop on Data-Intensive Computing in the Cloud. IEEE Press. Google ScholarDigital Library
Rosa Filgueira, Amrey Krause, Malcolm Atkinson, Iraklis Klampanos, Alessandro Spinuso, and Susana Sanchez-Exposito. 2015. dispel4py: An agile framework for data-intensive escience. In Proceedings of the IEEE 11th International Conference on e-Science (e-Science’15). IEEE, 454--464. Google ScholarDigital Library
Rosa Filguiera, Amrey Krause, Malcolm Atkinson, Iraklis Klampanos, and Alexander Moreno. 2017. dispel4py: A Python framework for data-intensive scientific computing. Int. J. High Perform. Comput. Appl. 31, 4 (2017), 316--334. Google ScholarDigital Library
Wai-Tat Fu and Wei Dong. 2012. Collabor. indexing and knowledge explor.: A social learn. model. IEEE Intell. Syst. 27, 1 (2012), 39--46. Google ScholarDigital Library
Gacto et al. 2010. Integration of an index to preserve the semantic interpretability in the multiobjective evolutionary rule selection and tuning of linguistic fuzzy systems. IEEE Trans. Fuzzy Syst. 18, 3 (2010), 515--531. Google ScholarDigital Library
Gani et al. 2016. A survey on indexing techniques for big data: Taxonomy and performance evaluation. Knowl. Inf. Syst. 46, 2 (2016), 241--284. Google ScholarDigital Library
Garg et al. 2018. Orchestration Tools for Big Data. Springer International Publishing, 1--9.Google Scholar
Gerlach et al. 2014. Skyport: Container-based execution environment management for multi-cloud scientific workflows. In Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds. IEEE Press, 25--32. Google ScholarDigital Library
George M. Giaglis. 2001. A taxonomy of business process modeling and information systems modeling techniques. Int. J. Flexible Manuf. Syst. 13, 2 (2001), 209--228.Google ScholarCross Ref
Glavic et al. 2011. The case for fine-grained stream provenance. In BTW Workshops, Vol. 11.Google Scholar
Glavic et al. 2014. Efficient stream provenance via operator instrumentation. ACM Trans. Internet Technol. (TOIT) 14, 1 (2014), 7. Google ScholarDigital Library
Boris Glavic. 2014. Big data provenance: Challenges and implications for benchmarking. In Specifying Big Data Benchmarks. Springer, 72--80. Google ScholarDigital Library
Gomes et al. 2018. Enabling rootless Linux containers in multi-user envin.: The udocker tool. Computer Physics Communications (2018).Google Scholar
Gonidis et al. 2013. Cloud application portability: An initial view. In Proceedings of the 6th Balkan Conference in Informatics. ACM. Google ScholarDigital Library
Hassan et al. 2017. Networks of the Future: Architectures, Technologies, and Implementations. Chapman and Hall/CRC. Google ScholarDigital Library
He et al. 2016. Efficient and anonymous mobile user authentication protocol using self-certified public key cryptography for multi-server architectures. IEEE Trans. Inf. Forensics Secur. 11, 9 (2016), 2052--2064. Google ScholarDigital Library
He et al. 2018. A provably-secure cross-domain handshake scheme with symptoms-matching for mobile healthcare social network. IEEE Trans. Dependable and Secure Comput. 15, 4 (2018), 633--645.Google ScholarCross Ref
Hirzel et al. 2013. IBM streams processing language: Analyzing big data in motion. IBM J. Res. Dev. 57, 3/4 (2013). Google ScholarDigital Library
Hu et al. 2014. Toward scalable systems for big data analytics: A technology tutorial. IEEE Access 2 (2014), 652--687.Google ScholarCross Ref
Hu et al. 2016. Flutter: Scheduling tasks closer to data across geo-distributed datacenters. In Proceedings of the 35th Annual IEEE INFOCOM. 1--9.Google ScholarDigital Library
Hung et al. 2015. Scheduling jobs across geo-distributed datacenters. In Proceedings of the 6th Symposium on Cloud Computing. ACM, 111--124. Google ScholarDigital Library
Huq et al. 2011. Inferring fine-grained data provenance in stream data processing: Reduced storage cost, high accuracy. In Proceedings of the International Conference on Database and Expert Systems Applications. Springer. Google ScholarDigital Library
Interlandi et al. 2017. Adding data provenance support to Apache Spark. The VLDB J. (2017), 1--21. Google ScholarDigital Library
Matteo Interlandi and Tyson Condie. 2018. Supporting data provenance in data-intensive scalable comp. sys. Data Eng. (2018), 63.Google Scholar
Michael Isard and Martín Abadi. 2015. Falkirk wheel: Rollback recovery for dataflow systems. arXiv preprint arXiv:1503.08877 (2015).Google Scholar
Jin et al. 2016. Workload-aware scheduling across geo-distributed data centers. In Trustcom/BigDataSE/ISPA. IEEE, 1455--1462.Google Scholar
Todd Jr. et al. 2017. Data analytics computing resource provisioning based on computed cost and time parameters for proposed computing resource configurations. US Patent 9,684,866.Google Scholar
Jrad et al. 2012. SLA based service brokering in intercloud environments. CLOSER 2012 (2012), 76--81.Google Scholar
Jrad et al. 2013. A broker-based framework for multi-cloud workflows. In Proceedings of the Intern. Workshop on Multi-cloud Applications and Federated Clouds. Google ScholarDigital Library
Andrey Kashlev and Shiyong Lu. 2014. A system architecture for running big data workflows in the cloud. In Proceedings of the International Conference on Services Computing (SCC). IEEE, 51--58. Google ScholarDigital Library
Kaur et al. 2017. Container-as-a-service at the edge: Trade-off between energy efficiency and service availability at fog nano data centers. IEEE Wireless Commun. 24, 3 (2017), 48--56.Google ScholarDigital Library
Tyler Keenan. 2016. Streaming Data: Big Data at High Velocity. Retrieved from https://www.upwork.com/hiring/data/streaming-data-high-velocity/.Google Scholar
Kiran et al. 2015. Lambda architecture for cost-effective batch and speed bigdata process. In Proceedings of the International Conference on Big Data. Google ScholarDigital Library
Komkhao et al. 2013. Incremental collaborative filtering based on Mahalanobis distance and fuzzy membership for recommender systems. Int. J. Gen. Syst. 42, 1 (2013), 41--66.Google ScholarCross Ref
Kurtzer et al. 2017. Singularity: Scientific containers for mobility of compute. PloS One 12, 5 (2017), e0177459.Google ScholarCross Ref
Palden Lama and Xiaobo Zhou. 2012. Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 63--72. Google ScholarDigital Library
Li et al. 2017. Study on fault tolerance method in cloud platform based on workload consolidation model of virtual machine. J. Eng. Sci. Technol. Rev. 10, 5 (2017), 41--49.Google ScholarCross Ref
Lin et al. 2016. StreamScope: Continuous reliable distributed processing of big data streams. In NSDI. 439--453. Google ScholarDigital Library
Liu et al. 2014. Scientific workflow partitioning in multisite cloud. In Proceedings of the European Conference on Parallel Processing. Springer, 105--116. Google ScholarDigital Library
Liu et al. 2015. A survey of data-intensive scientific workflow management. J. Grid Comput. 13, 4 (2015), 457--493. Google ScholarDigital Library
Liu et al. 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS J. PRS 115 (2016), 134--142.Google ScholarCross Ref
Liu et al. 2018. A survey of scheduling frameworks in big data systems. Int. J. Cloud Comput. (2018), 1--27.Google Scholar
Yang Liu and Wei Wei. 2015. A replication-based mechanism for fault tolerance in mapreduce framework. Math. Prob. Eng. 2015 (2015).Google Scholar
Rache lKempf. 2017. Open Source Data Pipeline—Luigi vs Azkaban vs Oozie vs Airflow. Retrieved from https://www.bizety.com/2017/06/05/open-source-data-pipeline-luigi-vs-azkaban-vs-oozie-vs-airflow/.Google Scholar
Lopez et al. 2016. A performance comparison of Open-Source stream processing platforms. In Proceedings of the Global Communications Conference (GLOBECOM).Google ScholarCross Ref
Dan Lynn. 2016. Apache Spark Cluster Managers: YARN, Mesos, or Standalone? Retrieved from http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/.Google Scholar
Ma et al. 2012. An efficient index for massive IOT data in cloud environment. In Proceedings of the 21st International Conference on IKM. 2129--2133. Google ScholarDigital Library
Mace et al. 2011. The case for dynamic security solutions in public cloud workflow deployments. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). 111--116. Google ScholarDigital Library
Malik et al. 2010. Tracking and sketching distributed data provenance. In Proceedings of the 6th International Conference on e-Science. IEEE. Google ScholarDigital Library
Mansouri et al. 2017. Data storage management in cloud envirn.: Taxonomy, survey, and future directions. ACM CSUR 50, 6 (2017), 1--51. Google ScholarDigital Library
Di Martino et al. 2015. Cross-platform cloud APIs. In Cloud Portability and Interoperability. Springer, 45--57.Google Scholar
Ulf Mattsson. 2016. Data centric security key to cloud and digital business. Retrieved from https://www.helpnetsecurity.com/2016/03/22/data-centric-security/.Google Scholar
Mikami et al. 2011. Using the Gfarm file system as a POSIX compatible storage platform for Hadoop MapReduce applications. In Proceedings of the12th IEEE/ACM International Conference on Grid Computing (GRID). IEEE, 181--189. Google ScholarDigital Library
Mohan et al. 2016. A NOSQL data model for scalable big data workflow execution. In Proceedings of the International Congress on Big Data (BigData Congress).Google ScholarCross Ref
Mon et al. 2016. Clustering based on task dependency for data-intensive workflow scheduling optimization. In Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS). IEEE, 20--25. Google ScholarDigital Library
Nachiappan et al. 2017. Cloud storage reliability for big data applications: A state of the art survey. J. Netw. Comput. Appl. 97 (2017), 35--47. Google ScholarDigital Library
Matri et al. 2016. Tỳr: Efficient Transactional Storage for Data-Intensive Applications. Ph.D. Dissertation. Inria Rennes Bretagne Atlantique; Universidad Politécnica de Madrid.Google Scholar
Suraj Pandey and Rajkumar Buyya. 2012. A survey of scheduling and management techniques for data-intensive application workflows. In Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management. IGI Global, 156--176.Google Scholar
Park et al. 2011. Ramp: A system for capturing and tracing provenance in mapreduce workflows. In Proceedings of 37th International Conference on Very Large Data Bases (VLDB’11).Google ScholarDigital Library
Pawluk et al. 2012. Introducing STRATOS: A cloud broker service. In Proceedings of the 5th International Conference on Cloud Computing (CLOUD). Google ScholarDigital Library
Peoples et al. 2013. The standardisation of cloud computing: Trends in the state-of-the-art and management issues for the next generation of cloud. In Proceedings of the Science and Information Conference (SAI). IEEE.Google Scholar
Poola et al. 2014. Fault-tolerant workflow scheduling using spot instances on clouds. Procedia Comput. Sci. 29 (2014), 523--533.Google ScholarCross Ref
Poola et al. 2016. Enhancing reliability of workflow execution using task replication and spot instances. ACM Trans. Auton. Adapt. Syst. (TAAS) 10, 4 (2016), 1--30. Google ScholarDigital Library
Qasha et al. 2016. Dynamic deployment of scientific workflows in the cloud using container virtualization. In Proceedings of the International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 269--276.Google ScholarCross Ref
Rahman et al. 2011. A taxonomy and survey on autonomic management of applications in grid computing environments. Concurrency Comput. Pract. Experience 23, 16 (2011), 1990--2019. Google ScholarDigital Library
Ranjan et al. 2015. Cross-layer cloud resource configuration selection in the big data era. IEEE Cloud Comput. 2, 3 (2015), 16--22.Google ScholarCross Ref
Ranjan et al. 2017. Orchestrating BigData analysis workflows. IEEE Cloud Comput. 4, 3 (2017), 20--28.Google ScholarCross Ref
Rao et al. 2019. The big data system, components, tools, and technologies: A survey. Knowl. Inf. Syst. 60, 3 (2019), 1165--1245.Google ScholarCross Ref
K. H. K. Reddy and D. S. Roy. 2015. Dppacs: A novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59, 1 (2015), 64--82.Google Scholar
Maria Alejandra Rodriguez and Rajkumar Buyya. 2017. A taxonomy and survey on scheduling algorithms for scientific workflows in IaaS cloud computing environments. Concurrency Comput. Pract. Experience 29, 8 (2017).Google Scholar
Rodríguez-García et al. 2014. Creating a semantically-enhanced cloud services environment through ontology evolution. Future Gener. Comput. Syst. 32 (2014), 295--306. Google ScholarDigital Library
Sakr et al. 2011. A survey of large scale data management approaches in cloud envirns. IEEE Commun. Surv. Tutorials 13, 3 (2011), 311--336.Google ScholarCross Ref
Sakr et al. 2013. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. (CSUR) 46, 1 (2013), 11. Google ScholarDigital Library
Sansrimahachai et al. 2013. An on-the-fly provenance tracking mechanism for stream processing systems. In Proceedings of the 12th International Conference on Computer and Information Science (ICIS). IEEE, 475--481.Google ScholarCross Ref
Seiger et al. 2018. Toward an execution system for self-healing workflows in cyber-physical systems. Software 8 Syst. Model. 17, 2 (2018), 551--572. Google ScholarDigital Library
Shishido et al. 2018. (WIP) tasks selection policies for securing sensitive data on workflow scheduling in clouds. In IEEE SCC.Google Scholar
Silva et al. 2018. DfAnalyzer: Runtime dataflow analysis of scientific applications using provenance. VLDB Endowment 11, 12 (2018). Google ScholarDigital Library
Souza et al. 2018. Hybrid adaptive checkpointing for VM fault tolerance. In Proceedings of the International Conference on Cloud Engineering (IC2E).Google Scholar
Mesos Sphere. 2017. Apache Mesos. Retrieved from https://mesosphere.com/why-mesos/?utm_source=adwords8utm_medium=g8utm_campaign=438435124318utm_term=mesos8utm_content=1908059572258gclid=CLqw8o6J6dMCFdkGKgodYlsD_A.Google Scholar
Sun et al. 2017. Building a fault tolerant framework with deadline guarantee in big data stream computing environments. J. Comput. Syst. Sci. 89 (2017), 4--23.Google ScholarCross Ref
Sun et al. 2018. Rethinking elastic online scheduling of big data streaming applications over high-velocity continuous data streams. J. Supercomputing 74, 2 (2018), 615--636. Google ScholarDigital Library
Dawei Sun and Rui Huang. 2016. A stable online scheduling strategy for real-time stream computing over fluctuating big data streams. IEEE Access 4 (2016), 8593--8607.Google ScholarCross Ref
Talbi et al. 2012. Multi-objective optimization using metaheuristics: Non-standard algorithms. Int. Trans. Oper. Res. 19, 1-2 (2012), 283--305.Google ScholarCross Ref
Tan et al. 2014. Diff-Index: Differentiated index in distributed log-structured data stores. In EDBT. 700--711.Google Scholar
Toosi et al. 2018. Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using Aneka. Future Gener. Comput. Syst. 79, 2 (2018), 765--775. Google ScholarDigital Library
Tudoran et al. 2016. Overflow: Multi-site aware big data management for scientific workflows on clouds. IEEE TCC 4, 1 (2016), 76--89. Google ScholarDigital Library
Ulmer et al. 2018. Faodel: Data management for next-generation application workflows. In Proceedings of the 9th Workshop on Scientific Cloud Computing. Google ScholarDigital Library
Wil M. P. Van Der Aalst and Arthur HM Ter Hofstede. 2005. YAWL: Yet another workflow language. Inf. Syst. 30, 4 (2005), 245--275. Google ScholarDigital Library
Vavilapalli et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. ACM. Google ScholarDigital Library
Venkataraman et al. 2017. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 374--389. Google ScholarDigital Library
Nithya Vijayakumar and Beth Plale. 2007. Tracking stream provenance in complex event processing systems for workflow-driven computing. In Proceedings of the EDA-PS Workshop.Google Scholar
Vishwakarma et al. 2014. An eff. approach for inverted index pruning based on document relevance. In Proceedings of the 4th International Conference on CSNT. Google ScholarDigital Library
von Leon et al. 2019. A lightweight container middleware for edge cloud architectures. Fog and Edge Computing: Principles and Paradigms (2019), 145--170.Google Scholar
Vrable et al. 2012. BlueSky: A cloud-backed file system for the enterprise. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
Wang et al. 2014. Optimizing load balancing and data-locality with data-aware scheduling. In Proceedings of the International Conference on Big Data (Big Data).Google ScholarCross Ref
Wang et al. 2015. WaFS: A workflow-aware file system for effective storage utilization in the cloud. IEEE Trans. Comput. 64, 9 (2015), 2716--2729.Google ScholarDigital Library
Wang et al. 2016. Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency Comput. Pract. Experience 28, 1 (2016), 70--94. Google ScholarDigital Library
Wen et al. 2017. Cost effective, reliable and secure workflow deployment over federated clouds. IEEE TSC. 10, 6 (2017), 929--941.Google Scholar
Wu et al. 2010. Analyses of multi-level and component compressed bitmap indexes. ACM Trans. Database Syst. 35, 1 (2010), 2. Google ScholarDigital Library
Wu et al. 2015. Workflow scheduling in cloud: A survey. J. Supercomput. 71, 9 (2015), 3373--3418. Google ScholarDigital Library
Xu et al. 2017. On fault tolerance for distributed iterative dataflow processing. IEEE Trans. KDE 29, 8 (2017), 1709--1722.Google ScholarDigital Library
Yıldırım et al. 2012. GRAIL: A scalable index for reachability queries in very large graphs. VLDB J. 21, 4 (2012), 509--534. Google ScholarDigital Library
Yu et al. 2014. An efficient multidimension metadata index and search system for cloud data. In Proceedings of the 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 499--504. Google ScholarDigital Library
Jia Yu and Rajkumar Buyya. 2005. A taxonomy of scientific workflow systems for grid computing. ACM Sigmod Record 34, 3 (2005), 44--49. Google ScholarDigital Library
Zhang et al. 2013. A survey on cloud interoperability: taxon., stand., and practice. ACM SIGMETRICS Perf. Eval. Rev. 40, 4 (2013), 13--22. Google ScholarDigital Library
Zhao et al. 2014. Devising a cloud scientific workflow platform for big data. In World Congress on Services (SERVICES). IEEE. Google ScholarDigital Library
Zhao et al. 2015. A data placement strategy for data-intensive scientific workflows in cloud. In Proceedings of the 15th IEEE/ACM CCGRID. 928--934. Google ScholarDigital Library
Zhao et al. 2015. Enabling scalable scientific workflow management in the Cloud. Future Gener. Comput. Syst. 46 (2015), 3--16. Google ScholarDigital Library
Zhao et al. 2015. SLA-based resource scheduling for big data analytics as a service in cloud computing environments. In Proceedings of the 44th International Conference on Parallel Processing (ICPP). IEEE, 510--519. Google ScholarDigital Library
Zhao et al. 2016. Heuristic data placement for data-intensive applications in heterogeneous cloud. JECE (2016).Google Scholar
Zhao et al. 2016. A new energy-aware task scheduling method for data-intensive applications in the cloud. JNCA 59 (2016), 14--27. Google ScholarDigital Library
Charles Zheng and Douglas Thain. 2015. Integrating containers into workflows: A case study using makeflow, work queue, and docker. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing. ACM, 31--38. Google ScholarDigital Library
Chaochao Zhou and Saurabh Kumar Garg. 2015. Performance analysis of scheduling algorithms for dynamic workflow applications. In Proceedings of the International Congress on Big Data (BigData Congress). IEEE. Google ScholarDigital Library
Zhu et al. 2016. Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Trans. Parallel Distrib. Syst. 27, 12 (2016), 3501--3517. Google ScholarDigital Library

Index Terms

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Recommendations

Big data

We use structuralism and functionalism paradigms to analyze the origins of big data applications.Current trends and sources of big data.Processing technologies, methods and analysis techniques for big data are compared in detail.We analyze major ...
Read More
Big data analytics in Cloud computing: an overview
Abstract
Big Data and Cloud Computing as two mainstream technologies, are at the center of concern in the IT field. Every day a huge amount of data is produced from different sources. This data is so big in size that traditional processing tools are unable ...
Read More
'Big data', Hadoop and cloud computing in genomics

Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
Read More

Reviews

Reviewer: Dominik Strzalka

When processing different big data workflows, many new and (so far) unknown patterns and performance requirements are visible. We are forced to search new processing models and management techniques that can support the design of different aspects of big data workflows: infrastructure (hardware), platforms (software), and efficient methods for scheduling and deployment workflows. These big, serious, scientific, technological, organizational, and technical problems lead to at least three important challenges (research questions) that are expanded and developed in this paper: (1) A description of "the different models and fundamental requirements of big data workflow applications"; (2) The new challenges related to the cloud and edge data centers, and this type of workflow application; and (3) The known "approaches, techniques, tools, and technologies" for developing "a new big data orchestration system." In successive sections, the authors present different research challenges, an existing knowledge and approaches survey, and possible future development directions for orchestrating big data analysis workflows. They give a detailed overview of many different issues related to workflow orchestration, big data workflow requirements in the cloud, a new big data workflow application classification and research taxonomy, currently used techniques and approaches, different systems and examples with data workflow support, and some still open challenges in the field. The paper is supported with many valuable literature references, which show a general outline and the state of the art in big data computer systems organization and computing methodologies. This survey is a comprehensive analysis of many important and sometimes secondary issues, which suggests we may be facing an important paradigm shift in computer systems processing.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 52, Issue 5
September 2020
791 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3362097
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 September 2019
- Accepted: 1 May 2019
- Revised: 1 March 2019
- Received: 1 October 2018
Published in csur Volume 52, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Big data
and techniques
approaches
cloud computing
research taxonomy
workflow orchestration
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 1,596
  Total Downloads
- Downloads (Last 12 months)189
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format