Abstract
Interest in processing big data has increased rapidly to gain insights that can transform businesses, government policies, and research outcomes. This has led to advancement in communication, programming, and processing technologies, including cloud computing services and technologies such as Hadoop, Spark, and Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but composed of several individual analytical steps running in the form of a workflow. These big data workflows are vastly different in nature from traditional workflows. Researchers are currently facing the challenge of how to orchestrate and manage the execution of such workflows. In this article, we discuss in detail orchestration requirements of these workflows as well as the challenges in achieving these requirements. We also survey current trends and research that supports orchestration of big data workflows and identify open research challenges to guide future developments in this area.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions
- {n.d.}. Chapter 15 - A taxonomy and survey of fault-tolerant workflow manag. sys. in cloud and dist. computing env. In Software Architecture for Big Data and the Cloud, Ivan Mistrik, Rami Bahsoon, Nour Ali, Maritta Heisel, and Bruce Maxim (Eds.). Morgan Kaufmann.Google Scholar
- 2015. Anomaly Detection over Sensor Data Streams. Retrieved from http://wiki.clommunity-project.eu/pilots:and.Google Scholar
- Adamu et al. 2016. A Survey on Big Data Indexing Strategies. Technical Report. SLAC National Accelerator Lab., Menlo Park, CA.Google Scholar
- Ahmad et al. 2014. Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In Proceedings of the 4th International Conference on Big Data and Cloud Computing (BdCloud). IEEE, 129--136. Google ScholarDigital Library
- Ahmad et al. 2017. Optim. of data-intensive workflows in stream-based data process. models. J Supercomput. 73, 9 (2017), 3901--3923. Google ScholarDigital Library
- Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain. 2012. Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. Google ScholarDigital Library
- Alrokayan et al. 2014. Sla-aware provisioning and scheduling of cloud resources for big data analytics. In CCEM. IEEE, 1--8.Google Scholar
- Amazon. 2017. AWS Lambda. Retrieved from https://aws.amazon.com/lambda/details/.Google Scholar
- Amstutz et al. 2016. Common workflow language, draft 3.Google Scholar
- Beloglazov et al. 2012. Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Gener. Comput. Syst. 28, 5 (2012), 755--768. Google ScholarDigital Library
- Bessani et al. 2013. DepSky: Dependable and secure storage in a cloud-of-clouds. ACM Trans. Storage (TOS) 9, 4 (2013), 12. Google ScholarDigital Library
- Bessani et al. 2014. SCFS: A shared cloud-backed file system. In USENIX Annual Technical Conference. Google ScholarDigital Library
- Bhuvaneshwar et al. 2015. A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput. Struct. Biotechnology J. 13 (2015), 64--74.Google ScholarCross Ref
- Bicer et al. 2013. Integrating online compression to accelerate large-scale data analytics applications. In Proceedings of the 27th International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE, 1205--1216. Google ScholarDigital Library
- Bohli et al. 2013. Security and privacy-enhancing multicloud arch. IEEE Trans. Dependable Secure Comput. 10, 4 (2013), 212--224. Google ScholarDigital Library
- Marc Bux and Ulf Leser. 2013. Parallelization in scientific workflow management systems. arXiv preprint arXiv:1303.7195 (2013).Google Scholar
- Massimo Cafaro and Giovanni Aloisio. 2011. Grids, clouds, and virtualization. In Grids, Clouds and Virtualization. Springer, 1--21. Google ScholarDigital Library
- Cai et al. 2017. IoT-based big data storage systems in cloud comp.: Perspectives and challenges. IEEE IoT J. 4, 1 (2017), 75--87.Google Scholar
- Cao et al. 2016. A resource provisioning strategy for elastic analytical workflows in the cloud. In Proceedings of the 18th International Conference on High-Performance Computing and Communications, 14th International Conference on Smart City, and 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 538--545.Google Scholar
- Chen et al. 2013. Big data challenge: A data management perspective. Front. Comput. Sci. 7, 2 (2013), 157--164. Google ScholarDigital Library
- Chen et al. 2018. Scheduling jobs across geo-distributed datacenters with max-min fairness. IEEE Trans. Network Sci.Eng. (2018). PrePrints.Google Scholar
- CL Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf. Sci. 275 (2014), 314--347.Google ScholarCross Ref
- Peng Chen. 2016. Big data analytics in static and streaming provenance.Google Scholar
- Weiwei Chen and Ewa Deelman. 2011. Partitioning and scheduling workflows across multiple sites with storage constraints. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. Springer. Google ScholarDigital Library
- Weiwei Chen and Ewa Deelman. 2012. Integration of workflow partitioning and resource provisioning. In Proceedings of the 12th International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 764--768. Google ScholarDigital Library
- Condie et al. 2010. MapReduce online. In NSDI, Vol. 10. 20. Google ScholarDigital Library
- Convolbo et al. 2018. GEODIS: Towards optim. of data locality-aware job sched. in geo-distrib. datacenters. Comput. 100, 1 (2018), 21--46. Google ScholarDigital Library
- Costa et al. 2011. Byzantine fault-tolerant MapReduce: Faults are not just crashes. In Proceedings of the 3rd International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 32--39. Google ScholarDigital Library
- Costa et al. 2014. Towards an adaptive and distributed architecture for managing workflow provenance data. In Proceedings of the 10th International Conference on e-Science (e-Science), Vol. 2. IEEE. Google ScholarDigital Library
- Alfredo Cuzzocrea. 2014. Privacy and security of big data: Current challenges and future research perspectives. In Proceedings of the 1st International Workshop on Privacy and Secuirty of Big Data. ACM. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- Demchenko et al. 2017. Defining intercloud security framework and architecture components for multi-cloud data intensive applications. In Proceedings of the 17th International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 945--952. Google ScholarDigital Library
- Dong et al. 2013. COLO: COarse-grained LOck-stepping virtual machines for non-stop service. In Proceedings of the 4th Annual Symposium on Cloud Computing. Google ScholarDigital Library
- Dong et al. 2017. Betrayal, distrust, and rationality: Smart counter-collusion contracts for verifiable cloud computing. In Proceedings of the SIGSAC Conference on Computer and Communications Security. ACM, 211--227. Google ScholarDigital Library
- Ebrahimi et al. 2015. TPS: A task placement strategy for big data workflows. In Proceedings of the International Conference on Big Data (Big Data). IEEE, 523--530. Google ScholarDigital Library
- Ahmed Eldawy and Mohamed F. Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In Proceedings of the IEEE 31st International Conference on Data Engineering (ICDE’15). IEEE, 1352--1363.Google Scholar
- Fernando et al. 2018. WorkflowDSL: Scalable workflow execution with provenance for data analysis applications. In Proceedings of the 42nd Annual Computer Software and Applications Conference (COMPSAC). IEEE, 774--779.Google ScholarCross Ref
- Filgueira et al. 2016. Asterism: Pegasus and dispel4py hybrid workflows for data-intensive science. In Proceedings of the 7th International Workshop on Data-Intensive Computing in the Cloud. IEEE Press. Google ScholarDigital Library
- Rosa Filgueira, Amrey Krause, Malcolm Atkinson, Iraklis Klampanos, Alessandro Spinuso, and Susana Sanchez-Exposito. 2015. dispel4py: An agile framework for data-intensive escience. In Proceedings of the IEEE 11th International Conference on e-Science (e-Science’15). IEEE, 454--464. Google ScholarDigital Library
- Rosa Filguiera, Amrey Krause, Malcolm Atkinson, Iraklis Klampanos, and Alexander Moreno. 2017. dispel4py: A Python framework for data-intensive scientific computing. Int. J. High Perform. Comput. Appl. 31, 4 (2017), 316--334. Google ScholarDigital Library
- Wai-Tat Fu and Wei Dong. 2012. Collabor. indexing and knowledge explor.: A social learn. model. IEEE Intell. Syst. 27, 1 (2012), 39--46. Google ScholarDigital Library
- Gacto et al. 2010. Integration of an index to preserve the semantic interpretability in the multiobjective evolutionary rule selection and tuning of linguistic fuzzy systems. IEEE Trans. Fuzzy Syst. 18, 3 (2010), 515--531. Google ScholarDigital Library
- Gani et al. 2016. A survey on indexing techniques for big data: Taxonomy and performance evaluation. Knowl. Inf. Syst. 46, 2 (2016), 241--284. Google ScholarDigital Library
- Garg et al. 2018. Orchestration Tools for Big Data. Springer International Publishing, 1--9.Google Scholar
- Gerlach et al. 2014. Skyport: Container-based execution environment management for multi-cloud scientific workflows. In Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds. IEEE Press, 25--32. Google ScholarDigital Library
- George M. Giaglis. 2001. A taxonomy of business process modeling and information systems modeling techniques. Int. J. Flexible Manuf. Syst. 13, 2 (2001), 209--228.Google ScholarCross Ref
- Glavic et al. 2011. The case for fine-grained stream provenance. In BTW Workshops, Vol. 11.Google Scholar
- Glavic et al. 2014. Efficient stream provenance via operator instrumentation. ACM Trans. Internet Technol. (TOIT) 14, 1 (2014), 7. Google ScholarDigital Library
- Boris Glavic. 2014. Big data provenance: Challenges and implications for benchmarking. In Specifying Big Data Benchmarks. Springer, 72--80. Google ScholarDigital Library
- Gomes et al. 2018. Enabling rootless Linux containers in multi-user envin.: The udocker tool. Computer Physics Communications (2018).Google Scholar
- Gonidis et al. 2013. Cloud application portability: An initial view. In Proceedings of the 6th Balkan Conference in Informatics. ACM. Google ScholarDigital Library
- Hassan et al. 2017. Networks of the Future: Architectures, Technologies, and Implementations. Chapman and Hall/CRC. Google ScholarDigital Library
- He et al. 2016. Efficient and anonymous mobile user authentication protocol using self-certified public key cryptography for multi-server architectures. IEEE Trans. Inf. Forensics Secur. 11, 9 (2016), 2052--2064. Google ScholarDigital Library
- He et al. 2018. A provably-secure cross-domain handshake scheme with symptoms-matching for mobile healthcare social network. IEEE Trans. Dependable and Secure Comput. 15, 4 (2018), 633--645.Google ScholarCross Ref
- Hirzel et al. 2013. IBM streams processing language: Analyzing big data in motion. IBM J. Res. Dev. 57, 3/4 (2013). Google ScholarDigital Library
- Hu et al. 2014. Toward scalable systems for big data analytics: A technology tutorial. IEEE Access 2 (2014), 652--687.Google ScholarCross Ref
- Hu et al. 2016. Flutter: Scheduling tasks closer to data across geo-distributed datacenters. In Proceedings of the 35th Annual IEEE INFOCOM. 1--9.Google ScholarDigital Library
- Hung et al. 2015. Scheduling jobs across geo-distributed datacenters. In Proceedings of the 6th Symposium on Cloud Computing. ACM, 111--124. Google ScholarDigital Library
- Huq et al. 2011. Inferring fine-grained data provenance in stream data processing: Reduced storage cost, high accuracy. In Proceedings of the International Conference on Database and Expert Systems Applications. Springer. Google ScholarDigital Library
- Interlandi et al. 2017. Adding data provenance support to Apache Spark. The VLDB J. (2017), 1--21. Google ScholarDigital Library
- Matteo Interlandi and Tyson Condie. 2018. Supporting data provenance in data-intensive scalable comp. sys. Data Eng. (2018), 63.Google Scholar
- Michael Isard and Martín Abadi. 2015. Falkirk wheel: Rollback recovery for dataflow systems. arXiv preprint arXiv:1503.08877 (2015).Google Scholar
- Jin et al. 2016. Workload-aware scheduling across geo-distributed data centers. In Trustcom/BigDataSE/ISPA. IEEE, 1455--1462.Google Scholar
- Todd Jr. et al. 2017. Data analytics computing resource provisioning based on computed cost and time parameters for proposed computing resource configurations. US Patent 9,684,866.Google Scholar
- Jrad et al. 2012. SLA based service brokering in intercloud environments. CLOSER 2012 (2012), 76--81.Google Scholar
- Jrad et al. 2013. A broker-based framework for multi-cloud workflows. In Proceedings of the Intern. Workshop on Multi-cloud Applications and Federated Clouds. Google ScholarDigital Library
- Andrey Kashlev and Shiyong Lu. 2014. A system architecture for running big data workflows in the cloud. In Proceedings of the International Conference on Services Computing (SCC). IEEE, 51--58. Google ScholarDigital Library
- Kaur et al. 2017. Container-as-a-service at the edge: Trade-off between energy efficiency and service availability at fog nano data centers. IEEE Wireless Commun. 24, 3 (2017), 48--56.Google ScholarDigital Library
- Tyler Keenan. 2016. Streaming Data: Big Data at High Velocity. Retrieved from https://www.upwork.com/hiring/data/streaming-data-high-velocity/.Google Scholar
- Kiran et al. 2015. Lambda architecture for cost-effective batch and speed bigdata process. In Proceedings of the International Conference on Big Data. Google ScholarDigital Library
- Komkhao et al. 2013. Incremental collaborative filtering based on Mahalanobis distance and fuzzy membership for recommender systems. Int. J. Gen. Syst. 42, 1 (2013), 41--66.Google ScholarCross Ref
- Kurtzer et al. 2017. Singularity: Scientific containers for mobility of compute. PloS One 12, 5 (2017), e0177459.Google ScholarCross Ref
- Palden Lama and Xiaobo Zhou. 2012. Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 63--72. Google ScholarDigital Library
- Li et al. 2017. Study on fault tolerance method in cloud platform based on workload consolidation model of virtual machine. J. Eng. Sci. Technol. Rev. 10, 5 (2017), 41--49.Google ScholarCross Ref
- Lin et al. 2016. StreamScope: Continuous reliable distributed processing of big data streams. In NSDI. 439--453. Google ScholarDigital Library
- Liu et al. 2014. Scientific workflow partitioning in multisite cloud. In Proceedings of the European Conference on Parallel Processing. Springer, 105--116. Google ScholarDigital Library
- Liu et al. 2015. A survey of data-intensive scientific workflow management. J. Grid Comput. 13, 4 (2015), 457--493. Google ScholarDigital Library
- Liu et al. 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS J. PRS 115 (2016), 134--142.Google ScholarCross Ref
- Liu et al. 2018. A survey of scheduling frameworks in big data systems. Int. J. Cloud Comput. (2018), 1--27.Google Scholar
- Yang Liu and Wei Wei. 2015. A replication-based mechanism for fault tolerance in mapreduce framework. Math. Prob. Eng. 2015 (2015).Google Scholar
- Rache lKempf. 2017. Open Source Data Pipeline—Luigi vs Azkaban vs Oozie vs Airflow. Retrieved from https://www.bizety.com/2017/06/05/open-source-data-pipeline-luigi-vs-azkaban-vs-oozie-vs-airflow/.Google Scholar
- Lopez et al. 2016. A performance comparison of Open-Source stream processing platforms. In Proceedings of the Global Communications Conference (GLOBECOM).Google ScholarCross Ref
- Dan Lynn. 2016. Apache Spark Cluster Managers: YARN, Mesos, or Standalone? Retrieved from http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/.Google Scholar
- Ma et al. 2012. An efficient index for massive IOT data in cloud environment. In Proceedings of the 21st International Conference on IKM. 2129--2133. Google ScholarDigital Library
- Mace et al. 2011. The case for dynamic security solutions in public cloud workflow deployments. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). 111--116. Google ScholarDigital Library
- Malik et al. 2010. Tracking and sketching distributed data provenance. In Proceedings of the 6th International Conference on e-Science. IEEE. Google ScholarDigital Library
- Mansouri et al. 2017. Data storage management in cloud envirn.: Taxonomy, survey, and future directions. ACM CSUR 50, 6 (2017), 1--51. Google ScholarDigital Library
- Di Martino et al. 2015. Cross-platform cloud APIs. In Cloud Portability and Interoperability. Springer, 45--57.Google Scholar
- Ulf Mattsson. 2016. Data centric security key to cloud and digital business. Retrieved from https://www.helpnetsecurity.com/2016/03/22/data-centric-security/.Google Scholar
- Mikami et al. 2011. Using the Gfarm file system as a POSIX compatible storage platform for Hadoop MapReduce applications. In Proceedings of the12th IEEE/ACM International Conference on Grid Computing (GRID). IEEE, 181--189. Google ScholarDigital Library
- Mohan et al. 2016. A NOSQL data model for scalable big data workflow execution. In Proceedings of the International Congress on Big Data (BigData Congress).Google ScholarCross Ref
- Mon et al. 2016. Clustering based on task dependency for data-intensive workflow scheduling optimization. In Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS). IEEE, 20--25. Google ScholarDigital Library
- Nachiappan et al. 2017. Cloud storage reliability for big data applications: A state of the art survey. J. Netw. Comput. Appl. 97 (2017), 35--47. Google ScholarDigital Library
- Matri et al. 2016. Tỳr: Efficient Transactional Storage for Data-Intensive Applications. Ph.D. Dissertation. Inria Rennes Bretagne Atlantique; Universidad Politécnica de Madrid.Google Scholar
- Suraj Pandey and Rajkumar Buyya. 2012. A survey of scheduling and management techniques for data-intensive application workflows. In Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management. IGI Global, 156--176.Google Scholar
- Park et al. 2011. Ramp: A system for capturing and tracing provenance in mapreduce workflows. In Proceedings of 37th International Conference on Very Large Data Bases (VLDB’11).Google ScholarDigital Library
- Pawluk et al. 2012. Introducing STRATOS: A cloud broker service. In Proceedings of the 5th International Conference on Cloud Computing (CLOUD). Google ScholarDigital Library
- Peoples et al. 2013. The standardisation of cloud computing: Trends in the state-of-the-art and management issues for the next generation of cloud. In Proceedings of the Science and Information Conference (SAI). IEEE.Google Scholar
- Poola et al. 2014. Fault-tolerant workflow scheduling using spot instances on clouds. Procedia Comput. Sci. 29 (2014), 523--533.Google ScholarCross Ref
- Poola et al. 2016. Enhancing reliability of workflow execution using task replication and spot instances. ACM Trans. Auton. Adapt. Syst. (TAAS) 10, 4 (2016), 1--30. Google ScholarDigital Library
- Qasha et al. 2016. Dynamic deployment of scientific workflows in the cloud using container virtualization. In Proceedings of the International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 269--276.Google ScholarCross Ref
- Rahman et al. 2011. A taxonomy and survey on autonomic management of applications in grid computing environments. Concurrency Comput. Pract. Experience 23, 16 (2011), 1990--2019. Google ScholarDigital Library
- Ranjan et al. 2015. Cross-layer cloud resource configuration selection in the big data era. IEEE Cloud Comput. 2, 3 (2015), 16--22.Google ScholarCross Ref
- Ranjan et al. 2017. Orchestrating BigData analysis workflows. IEEE Cloud Comput. 4, 3 (2017), 20--28.Google ScholarCross Ref
- Rao et al. 2019. The big data system, components, tools, and technologies: A survey. Knowl. Inf. Syst. 60, 3 (2019), 1165--1245.Google ScholarCross Ref
- K. H. K. Reddy and D. S. Roy. 2015. Dppacs: A novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. Comput. J. 59, 1 (2015), 64--82.Google Scholar
- Maria Alejandra Rodriguez and Rajkumar Buyya. 2017. A taxonomy and survey on scheduling algorithms for scientific workflows in IaaS cloud computing environments. Concurrency Comput. Pract. Experience 29, 8 (2017).Google Scholar
- Rodríguez-García et al. 2014. Creating a semantically-enhanced cloud services environment through ontology evolution. Future Gener. Comput. Syst. 32 (2014), 295--306. Google ScholarDigital Library
- Sakr et al. 2011. A survey of large scale data management approaches in cloud envirns. IEEE Commun. Surv. Tutorials 13, 3 (2011), 311--336.Google ScholarCross Ref
- Sakr et al. 2013. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. (CSUR) 46, 1 (2013), 11. Google ScholarDigital Library
- Sansrimahachai et al. 2013. An on-the-fly provenance tracking mechanism for stream processing systems. In Proceedings of the 12th International Conference on Computer and Information Science (ICIS). IEEE, 475--481.Google ScholarCross Ref
- Seiger et al. 2018. Toward an execution system for self-healing workflows in cyber-physical systems. Software 8 Syst. Model. 17, 2 (2018), 551--572. Google ScholarDigital Library
- Shishido et al. 2018. (WIP) tasks selection policies for securing sensitive data on workflow scheduling in clouds. In IEEE SCC.Google Scholar
- Silva et al. 2018. DfAnalyzer: Runtime dataflow analysis of scientific applications using provenance. VLDB Endowment 11, 12 (2018). Google ScholarDigital Library
- Souza et al. 2018. Hybrid adaptive checkpointing for VM fault tolerance. In Proceedings of the International Conference on Cloud Engineering (IC2E).Google Scholar
- Mesos Sphere. 2017. Apache Mesos. Retrieved from https://mesosphere.com/why-mesos/?utm_source=adwords8utm_medium=g8utm_campaign=438435124318utm_term=mesos8utm_content=1908059572258gclid=CLqw8o6J6dMCFdkGKgodYlsD_A.Google Scholar
- Sun et al. 2017. Building a fault tolerant framework with deadline guarantee in big data stream computing environments. J. Comput. Syst. Sci. 89 (2017), 4--23.Google ScholarCross Ref
- Sun et al. 2018. Rethinking elastic online scheduling of big data streaming applications over high-velocity continuous data streams. J. Supercomputing 74, 2 (2018), 615--636. Google ScholarDigital Library
- Dawei Sun and Rui Huang. 2016. A stable online scheduling strategy for real-time stream computing over fluctuating big data streams. IEEE Access 4 (2016), 8593--8607.Google ScholarCross Ref
- Talbi et al. 2012. Multi-objective optimization using metaheuristics: Non-standard algorithms. Int. Trans. Oper. Res. 19, 1-2 (2012), 283--305.Google ScholarCross Ref
- Tan et al. 2014. Diff-Index: Differentiated index in distributed log-structured data stores. In EDBT. 700--711.Google Scholar
- Toosi et al. 2018. Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using Aneka. Future Gener. Comput. Syst. 79, 2 (2018), 765--775. Google ScholarDigital Library
- Tudoran et al. 2016. Overflow: Multi-site aware big data management for scientific workflows on clouds. IEEE TCC 4, 1 (2016), 76--89. Google ScholarDigital Library
- Ulmer et al. 2018. Faodel: Data management for next-generation application workflows. In Proceedings of the 9th Workshop on Scientific Cloud Computing. Google ScholarDigital Library
- Wil M. P. Van Der Aalst and Arthur HM Ter Hofstede. 2005. YAWL: Yet another workflow language. Inf. Syst. 30, 4 (2005), 245--275. Google ScholarDigital Library
- Vavilapalli et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. ACM. Google ScholarDigital Library
- Venkataraman et al. 2017. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 374--389. Google ScholarDigital Library
- Nithya Vijayakumar and Beth Plale. 2007. Tracking stream provenance in complex event processing systems for workflow-driven computing. In Proceedings of the EDA-PS Workshop.Google Scholar
- Vishwakarma et al. 2014. An eff. approach for inverted index pruning based on document relevance. In Proceedings of the 4th International Conference on CSNT. Google ScholarDigital Library
- von Leon et al. 2019. A lightweight container middleware for edge cloud architectures. Fog and Edge Computing: Principles and Paradigms (2019), 145--170.Google Scholar
- Vrable et al. 2012. BlueSky: A cloud-backed file system for the enterprise. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Wang et al. 2014. Optimizing load balancing and data-locality with data-aware scheduling. In Proceedings of the International Conference on Big Data (Big Data).Google ScholarCross Ref
- Wang et al. 2015. WaFS: A workflow-aware file system for effective storage utilization in the cloud. IEEE Trans. Comput. 64, 9 (2015), 2716--2729.Google ScholarDigital Library
- Wang et al. 2016. Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency Comput. Pract. Experience 28, 1 (2016), 70--94. Google ScholarDigital Library
- Wen et al. 2017. Cost effective, reliable and secure workflow deployment over federated clouds. IEEE TSC. 10, 6 (2017), 929--941.Google Scholar
- Wu et al. 2010. Analyses of multi-level and component compressed bitmap indexes. ACM Trans. Database Syst. 35, 1 (2010), 2. Google ScholarDigital Library
- Wu et al. 2015. Workflow scheduling in cloud: A survey. J. Supercomput. 71, 9 (2015), 3373--3418. Google ScholarDigital Library
- Xu et al. 2017. On fault tolerance for distributed iterative dataflow processing. IEEE Trans. KDE 29, 8 (2017), 1709--1722.Google ScholarDigital Library
- Yıldırım et al. 2012. GRAIL: A scalable index for reachability queries in very large graphs. VLDB J. 21, 4 (2012), 509--534. Google ScholarDigital Library
- Yu et al. 2014. An efficient multidimension metadata index and search system for cloud data. In Proceedings of the 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 499--504. Google ScholarDigital Library
- Jia Yu and Rajkumar Buyya. 2005. A taxonomy of scientific workflow systems for grid computing. ACM Sigmod Record 34, 3 (2005), 44--49. Google ScholarDigital Library
- Zhang et al. 2013. A survey on cloud interoperability: taxon., stand., and practice. ACM SIGMETRICS Perf. Eval. Rev. 40, 4 (2013), 13--22. Google ScholarDigital Library
- Zhao et al. 2014. Devising a cloud scientific workflow platform for big data. In World Congress on Services (SERVICES). IEEE. Google ScholarDigital Library
- Zhao et al. 2015. A data placement strategy for data-intensive scientific workflows in cloud. In Proceedings of the 15th IEEE/ACM CCGRID. 928--934. Google ScholarDigital Library
- Zhao et al. 2015. Enabling scalable scientific workflow management in the Cloud. Future Gener. Comput. Syst. 46 (2015), 3--16. Google ScholarDigital Library
- Zhao et al. 2015. SLA-based resource scheduling for big data analytics as a service in cloud computing environments. In Proceedings of the 44th International Conference on Parallel Processing (ICPP). IEEE, 510--519. Google ScholarDigital Library
- Zhao et al. 2016. Heuristic data placement for data-intensive applications in heterogeneous cloud. JECE (2016).Google Scholar
- Zhao et al. 2016. A new energy-aware task scheduling method for data-intensive applications in the cloud. JNCA 59 (2016), 14--27. Google ScholarDigital Library
- Charles Zheng and Douglas Thain. 2015. Integrating containers into workflows: A case study using makeflow, work queue, and docker. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing. ACM, 31--38. Google ScholarDigital Library
- Chaochao Zhou and Saurabh Kumar Garg. 2015. Performance analysis of scheduling algorithms for dynamic workflow applications. In Proceedings of the International Congress on Big Data (BigData Congress). IEEE. Google ScholarDigital Library
- Zhu et al. 2016. Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Trans. Parallel Distrib. Syst. 27, 12 (2016), 3501--3517. Google ScholarDigital Library
Index Terms
- Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions
Recommendations
Big data analytics in Cloud computing: an overview
AbstractBig Data and Cloud Computing as two mainstream technologies, are at the center of concern in the IT field. Every day a huge amount of data is produced from different sources. This data is so big in size that traditional processing tools are unable ...
'Big data', Hadoop and cloud computing in genomics
Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
Comments