skip to main content
research-article

PROV2R: Practical Provenance Analysis of Unstructured Processes

Published: 18 August 2017 Publication History

Abstract

Information produced by Internet applications is inherently a result of processes that are executed locally. Think of a web server that makes use of a CGI script, or a content management system where a post was first edited using a word processor. Given the impact of these processes to the content published online, a consumer of that information may want to understand what those impacts were. For example, understanding from where text was copied and pasted to make a post, or if the CGI script was updated with the latest security patches, may all influence the confidence on the published content. Capturing and exposing this information provenance is thus important to ascertaining trust to online content. Furthermore, providers of internet applications may wish to have access to the same information for debugging or audit purposes. For processes following a rigid structure (such as databases or workflows), disclosed provenance systems have been developed that efficiently and accurately capture the provenance of the produced data. However, accurately capturing provenance from unstructured processes, for example, user-interactive computing used to produce web content, remains a problem to be tackled.
In this article, we address the problem of capturing and exposing provenance from unstructured processes. Our approach, called PROV2R (PROVenance Record and Replay) is composed of two parts: (a) the decoupling of provenance analysis from its capture; and (b) the capture of high-fidelity provenance from unmodified programs. We use techniques originating in the security and reverse engineering communities, namely, record and replay and taint tracking. Taint tracking fundamentally addresses the data provenance problem but is impractical to apply at runtime due to extremely high overhead. With a number of case studies, we demonstrate that PROV2R enables the use of taint analysis for high-fidelity provenance capture, while keeping the runtime overhead at manageable levels. In addition, we show how captured information can be represented using the W3C PROV provenance model for exposure on the Web.

References

[1]
Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos. 2012. System-level support for intrusion recovery. In Proceedings of DIMVA’12.
[2]
Adam Bates, Devin J. Pohly, and Kevin R. B. Butler. 2016. Secure and Trustworthy Provenance Collection for Digital Forensics. Springer, New York, NY, 141--176.
[3]
Adam Bates, Dave Tian, Kevin R. B. Butler, and Thomas Moyer. 2015. Trustworthy whole-system provenance for the linux kernel. In Proceedings of USENIX SEC’15.
[4]
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of USENIX ATC’05.
[5]
Erik Bosman, Asia Slowinska, and Herbert Bos. 2011. Minemu: The world’s fastest taint tracker. In Proceedings of RAID’11.
[6]
Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Selter, and Andy Hopper. 2014. A primer on provenance. Commun. ACM 57, 5 (2014), 52--60.
[7]
Lorenzo Cavallaro and R. Sekar. 2011. Taint-enhanced anomaly detection. In Proceedings of ICISS’11.
[8]
Yufei Chen and Haibo Chen. 2013. Scalable deterministic replay in a parallel full-system emulator. In Proceedings of ACM SIGPLAN PPoPP’13.
[9]
James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Found. Trends Data. 1, 4 (April 2009).
[10]
Fernando Chirigati, Rémi Rampin, Dennis Shasha, and Juliana Freire. 2016. ReproZip: Computational reproducibility with ease. In Proceedings of SIGMOD’16.
[11]
Jim Chow, Tal Garfinkel, and Peter M. Chen. 2008. Decoupling dynamic program analysis from execution in virtual environments. In Proceedings of USENIX ATC’08.
[12]
Jim Chow, Dominic Lucchetti, Tal Garfinkel, Geoffrey Lefebvre, Ryan Gardner, Joshua Mason, Sam Small, and Peter M. Chen. 2010. Multi-stage replay with crosscut. In ACM SIGPLAN Notices, Vol. 45. ACM, 13--24.
[13]
James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: A generic dynamic taint analysis framework. In Proceedings of ISSTA’07.
[14]
Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao Zhang, and Paul Barham. 2008. Vigilante: End-to-end containment of internet worm epidemics. ACM TOCS 26, 4 (December 2008).
[15]
Jedidiah R. Crandall and Frederic T. Chong. 2004. Minos: Control data attack prevention orthogonal to memory model. In Proceedings of MICRO-37’04.
[16]
CVE Details. 2016. Linux Kernel Vulnerability Statistics. (November 2016). Retrieved November 17, 2016 from http://www.cvedetails.com/product/47/Linux-Linux-Kernel.html.
[17]
Michael Dalton, Hari Kannan, and Christos Kozyrakis. 2010. Tainting is not pointless. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 88--92.
[18]
Dorothy E. Denning and Peter J. Denning. 1977. Certification of programs for secure information flow. Commun. ACM 20, 7 (July 1977).
[19]
David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M. Chen. 2014. Eidetic systems. In Proceedings of USENIX OSDI’14.
[20]
Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2014. Repeatable Reverse Engineering for the Greater Good with PANDA. Technical Report CUCS-023-14. Columbia University.
[21]
Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2015. Repeatable reverse engineering with PANDA. In Proceedings of PPREW’15.
[22]
Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. LAVA: Large-scale automated vulnerability addition (May 2016).
[23]
Brendan Dolan-Gavitt, Tim Leek, Josh Hodosh, and Wenke Lee. 2013. Tappan zee (north) bridge: Mining Memory Accesses for Introspection. In Proceedings of ACM SIGSAC CCS’13.
[24]
George W. Dunlap, Samuel T. King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen. 2002. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of USENIX OSDI’02.
[25]
Christof Fetzer and Martin Süßkraut. 2008. Switchblade: Enforcing dynamic personalized system call models. In Proceedings of ACM SIGOPS EuroSys’08.
[26]
James Frew, Dominic Metzger, and Peter Slaughter. 2008. Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. 8 Exper. 20, 5 (April 2008), 485--596.
[27]
Ashish Gehani and Dawood Tariq. 2012. SPADE: Support for provenance auditing in distributed environments. In Proceedings of Middleware’12.
[28]
Eleni Gessiou, Vasilis Pappas, Elias Athanasopoulos, Angelos D. Keromytis, and Sotiris Ioannidis. 2012. Towards a universal data provenance framework using dynamic instrumentation. IFIP Advances in Information and Communication Technology, Vol. 376. 103--114.
[29]
Boris Glavic. 2014. A Primer on Database Provenance. Technical Report RIIT/CS-DB-2014-01. Illinois Institute of Technology.
[30]
Paul Groth, Simon Miles, and Luc Moreau. 2009. A model of process documentation to determine provenance in mash-ups. ACM Trans. Internet Technol. 9, 1 (February 2009), 3:1--3:31.
[31]
Paul Groth and Luc Moreau. 2009. Recording process documentation for provenance. IEEE Transactions on Parallel and Distributed Systems (September 2009).
[32]
Paul Groth and Luc Moreau (eds.). 2013. PROV-Overview: An Overview of the PROV Family of Documents. W3C Working Group Note NOTE-prov-overview-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/
[33]
David A. Holland, Margo I. Seltzer, Uri Braun, and Kiran-Kumar Muniswamy-Reddy. 2008. PASSing the provenance challenge. Concurr. Comput.: Pract. 8 Exper. 20, 5 (April 2008), 531--540.
[34]
Trung Dong Huynh, Paul Groth, and Stephan Zednik (eds.). 2013. PROV Implementation Report. W3C Working Group Note NOTE-prov-implementations-20130430. World Wide Web Consortium.
[35]
Kangkook Jee, Vasileios P. Kemerlis, Angelos D. Keromytis, and Georgios Portokalidis. 2013. ShadowReplica: Efficient parallelization of dynamic data flow tracking. In Proceedings of ACM SIGSAC CCS’13.
[36]
Yang Ji, Sangho Lee, and Wenke Lee. 2016. RecProv: Towards provenance-aware user space record and replay. In Proceedings of IPAW’16, Marta Mattoso and Boris Glavic (Eds.).
[37]
Xuxian Jiang, Xinyuan Wang, and Dongyan Xu. 2007. Stealthy malware detection through VMM-based out-of-the-box semantic view reconstruction. In Proceedings of ACM SIGSAC CCS’07.
[38]
Min Gyung Kang, Stephen McCamant, Pongsin Poosankam, and Dawn Song. 2011. DTA++: Dynamic taint analysis with targeted control-flow propagation. In Proceedings of NDSS’11.
[39]
D. B. Keator, K. Helmer, J. Steffener, J. A. Turner, T. G. M. Van Erp, S. Gadde, N. Ashish, G. A. Burns, and B. N. Nichols. 2013. Towards structured sharing of raw and derived neuroimaging data across existing resources. NeuroImage 82 (2013), 647--661.
[40]
Vasileios P. Kemerlis, Georgios Portokalidis, Kangkook Jee, and Angelos D. Keromytis. 2012. libdft: Practical dynamic data flow tracking for commodity systems. In Proceedings of VEE’12.
[41]
Graham Klyne, Paul Groth (eds.), Luc Moreau, Olaf Hartig, Yogesh Simmhan, James Myers, Timothy Lebo, Khalid Belhajjame, and Simon Miles. 2013. PROV-AQ: Provenance Access and Query. W3C Working Group Note NOTE-prov-aq-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/NOTE-prov-aq-20130430/.
[42]
Troy Kohwalter, Thiago Oliveira, Juliana Freire, Esteban Clua, and Leonardo Murta. 2016. Prov viewer: A graph-based visualization tool for interactive exploration of provenance data. In Proceedings of IPAW’16.
[43]
Kostya Kortchinsky. 2009. Cloudburst: A vmware guest to host escape. In Black Hat Conference.
[44]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of CGO’04.
[45]
Timothy Lebo, Satya Sahoo, Deborah McGuinness (eds.), Khalid Behajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao. 2013. PROV-O: The PROV Ontology. W3C Recommendation REC-prov-o-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/REC-prov-o-20130430
[46]
Rongxing Lu, Xiaodong Lin, Xiaohui Liang, and Xuemin (Sherman) Shen. 2010. Secure provenance: The essential of bread and butter of data forensics in cloud computing. In Proceedings of ACM SIGSAC ASIACCS’10.
[47]
Shiqing Ma, Xiangyu Zhang, and Dongyan Xu. 2016. ProTracer: Towards practical provenance tracing by alternating between logging and tainting. In Proceedings of NDSS’16.
[48]
Xiaogang Ma, Peter Fox, Curt Tilmes, Katharine Jacobs, and Anne Waple. 2014. Capturing provenance of global change information. Nature Clim. Change 4, 6 (06 2014), 409--413.
[49]
Wes Masri, Andy Podgurski, and David Leon. 2004. Detecting and debugging insecure information flows. In Proceedings of ISSRE’04.
[50]
Stephen McCamant and Michael D. Ernst. 2006. Quantitative Information-Flow Tracking for C and Related Languages. Technical Report MIT-CSAIL-TR-2006-076. MIT, Cambridge, MA.
[51]
Luc Moreau. 2010. The foundations for provenance on the Web. Foundations and Trends in Web Science 2, 2--3 (November 2010).
[52]
Luc Moreau and Paul Groth. 2013. Provenance: An introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology 3, 4 (2013).
[53]
Luc Moreau and Paolo Missier. 2013. PROV-DM: The PROV Data Model. Recommendation REC-prov-dm-20130430. W3C. Retrieved from http://www.w3.org/TR/2013/REC-prov-dm-20130430/
[54]
Tom Oinn, Mark Greenwood, and et al. 2006. Taverna: Lessons in creating a workflow environment for the life sciences. Concurr. Comput.: Pract. 8 Exper. 18, 10 (2006).
[55]
Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. 2010. PinPlay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of CGO’10.
[56]
Devin J. Pohly, Stephen McLaughlin, Patrick McDaniel, and Kevin Butler. 2012. Hi-Fi: Collecting high-fidelity whole-system provenance. In Proceedings of ACSAC’12.
[57]
Georgios Portokalidis, Philip Homburg, Kostas Anagnostakis, and Herbert Bos. 2010. Paranoid android: Versatile protection for smartphones. In Proceedings of ACSAC’10.
[58]
Georgios Portokalidis, Asia Slowinska, and Herbert Bos. 2006. Argos: An emulator for fingerprinting zero-day attacks for advertised honeypots with automatic signature generation. In Proceedings of EuroSys’06.
[59]
Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song. 2016. Samsara: Efficient deterministic replay in multiprocessor environments with hardware virtualization extensions. In Proceedings of USENIX ATC’16.
[60]
Darren P. Richardson and Luc Moreau. 2016. Towards the domain agnostic generation of natural language explanations from provenance graphs for casual users. In Proceedings of IPAW’16.
[61]
Prateek Saxena, R. Sekar, and Varun Puranik. 2008. Efficient fine-grained binary instrumentation with applications to taint-tracking. In Proceedings of CGO’08.
[62]
Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2005. A survey of data provenance in e-science. SIGMOD Rec. 34, 3 (2005).
[63]
Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2008. Karma2: Provenance management for data driven workflows. Int. J. Web Serv. Res. 5, 2 (2008).
[64]
Asia Slowinska and Herbert Bos. 2009. Pointless tainting?: Evaluating the practicality of pointer tainting. In Proceedings of EuroSys’09.
[65]
Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2014. Looking inside the black-box: Capturing data provenance using dynamic instrumentation. In Proceedings of IPAW’14.
[66]
Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2015. Decoupling provenance capture and analysis from execution. In Proceedings of USENIX TaPP’15. http://dare.ubvu.vu.nl/handle/1871/53077
[67]
Manolis Stamatogiannakis, Hasanat Kazmi, Hashim Sharif, Remco Vermeulen, Ashish Gehani, Herbert Bos, and Paul Groth. 2016. Trade-offs in automatic provenance capture. In Proceedings of IPAW’16.
[68]
Tao Wang, Jiwei Xu, Wenbo Zhang, Jianhua Zhang, Jun Wei, and Hua Zhong. 2016. ReSeer: Efficient search-based replay for multiprocessor virtual machines. J. Syst. Software (2016).
[69]
Ryan Whelan, Tim Leek, and David Kaeli. 2013. Architecture-independent dynamic information flow tracking. In Proceedings of CC’13.
[70]
Wikipedia. 2016. Virtual machine escape. (November 2016). Retrieved November 17, 2016 from https://en.wikipedia.org/wiki/Virtual_machine_escape
[71]
Min Xu, Rastislav Bodik, and Mark D. Hill. 2003. A “flight data recorder” for enabling full-system multiprocessor deterministic replay. In Proceedings of ACM ISCA’03.
[72]
Min Xu, Vyacheslav Malyugin, Jeffrey Sheldon, Ganesh Venkitachalam, and Boris Weissman. 2007. ReTrace: Collecting execution trace with virtual machine deterministic replay. In Proceedings of MoBS’07.
[73]
Lok Kwong Yan and Heng Yin. 2012. DroidScope: Seamlessly reconstructing the OS and dalvik semantic views for dynamic android malware analysis. In Proceedings USENIX SEC’12.
[74]
Heng Yin and Dawn Song. 2010. TEMU: Binary Code Analysis via Whole-System Layered Annotative Execution. Technical Report UCB/EECS-2010-3. EECS Department, University of California, Berkeley. Retrieved November 17, 2016 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-3.html.

Cited By

View all
  • (2019)Mal-FluxDigital Investigation: The International Journal of Digital Forensics & Incident Response10.1016/j.diin.2019.01.00428:C(83-95)Online publication date: 1-Mar-2019
  • (2018)Provenance of Dynamic Adaptations in User-Steered DataflowsProvenance and Annotation of Data and Processes10.1007/978-3-319-98379-0_2(16-29)Online publication date: 6-Sep-2018

Index Terms

  1. PROV2R: Practical Provenance Analysis of Unstructured Processes

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Internet Technology
      ACM Transactions on Internet Technology  Volume 17, Issue 4
      Special Issue on Provenance of Online Data and Regular Papers
      November 2017
      165 pages
      ISSN:1533-5399
      EISSN:1557-6051
      DOI:10.1145/3133307
      • Editor:
      • Munindar P. Singh
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 August 2017
      Accepted: 01 March 2017
      Revised: 01 February 2017
      Received: 01 August 2016
      Published in TOIT Volume 17, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data provenance
      2. PANDA
      3. W3C PROV
      4. introspection
      5. taint analysis

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)Mal-FluxDigital Investigation: The International Journal of Digital Forensics & Incident Response10.1016/j.diin.2019.01.00428:C(83-95)Online publication date: 1-Mar-2019
      • (2018)Provenance of Dynamic Adaptations in User-Steered DataflowsProvenance and Annotation of Data and Processes10.1007/978-3-319-98379-0_2(16-29)Online publication date: 6-Sep-2018

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media