research-article

LAMP: data provenance for graph based machine learning algorithms through derivative computation

Authors:
Shiqing Ma

Purdue University, USA

Purdue University, USA
View Profile

,
Yousra Aafer

Purdue University, USA

Purdue University, USA
View Profile

,
Zhaogui Xu

Nanjing University, China

Nanjing University, China
View Profile

,
Wen-Chuan Lee

Purdue University, USA

Purdue University, USA
View Profile

,
Juan Zhai

Nanjing University, China

Nanjing University, China
View Profile

,
Yingqi Liu

Purdue University, USA

Purdue University, USA
View Profile

,
Xiangyu Zhang

Purdue University, USA

Purdue University, USA
View Profile

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software EngineeringAugust 2017Pages 786–797https://doi.org/10.1145/3106237.3106291

Published:21 August 2017Publication History

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

Pages 786–797

ABSTRACT

Data provenance tracking determines the set of inputs related to a given output. It enables quality control and problem diagnosis in data engineering. Most existing techniques work by tracking program dependencies. They cannot quantitatively assess the importance of related inputs, which is critical to machine learning algorithms, in which an output tends to depend on a huge set of inputs while only some of them are of importance. In this paper, we propose LAMP, a provenance computation system for machine learning algorithms. Inspired by automatic differentiation (AD), LAMP quantifies the importance of an input for an output by computing the partial derivative. LAMP separates the original data processing and the more expensive derivative computation to different processes to achieve cost-effectiveness. In addition, it allows quantifying importance for inputs related to discrete behavior, such as control flow selection. The evaluation on a set of real world programs and data sets illustrates that LAMP produces more precise and succinct provenance than program dependence based techniques, with much less overhead. Our case studies demonstrate the potential of LAMP in problem diagnosis in data engineering.

References

Anti Spam. https://github.com/dinever/antispam.Google Scholar
Bayes Spam. https://github.com/SunnyMarkLiu/NaiveBayesSpamFilter.Google Scholar
Bayesian classifier. https://github.com/browning/comment-troll-classifier.Google Scholar
Bayesian Classifier Bug. http://stackoverflow.com/questions/19349567/ error-in-naive-bayes-classifier.Google Scholar
Bayesian Spam Filter. https://github.com/aashishsatya/Bayesian-Spam-Filter.Google Scholar
Bugs in ByeasianModel. https://github.com/pgmpy/pgmpy/issues/607.Google Scholar
Dirichlet Concentration : Error 73. https://github.com/bayespy/bayespy/issues/73.Google Scholar
Graph Tool. https://graph-tool.skewed.de.Google Scholar
KDD Cup 2012 Track1. https://www.kaggle.com/c/kddcup2012-track1.Google Scholar
LAMP. https://github.com/PythonLAMP/LAMP.Google Scholar
National Cancer Database - American College of Surgeons. https://www.facs.org/ quality%20programs/cancer/ncdb.Google Scholar
NetworkX. https://networkx.github.io.Google Scholar
PageRank toy example fails to converge. http://stackoverflow.com/questions/ 30017913/pagerank-toy-example-fails-to-converge.Google Scholar
A Plan For Spam. http://www.paulgraham.com/spam.html.Google Scholar
Python Memory Profiler. https://pypi.python.org/pypi/memory_profiler.Google Scholar
Scrapy. https://scrapy.org/.Google Scholar
SMS Spam Filter. https://github.com/revantkumar/SMS-Spam-Classifier.Google Scholar
Spam Filter. https://github.com/jameshwang2013/SpamFilter.Google Scholar
Spam Filter. https://github.com/SunnyMarkLiu/NaiveBayesSpamFilter.Google Scholar
Tencent 2012 KDD Cup. http://www.kddcup2012.org/.Google Scholar
TextRank using NetworkX. http://stackoverflow.com/questions/9247538/ textrank-complementing-pagerank-for-sentence-extraction-using-networkx/ 9247791.Google Scholar
Wikipedia Data Set. https://dumps.wikimedia.org/enwiki/20160601/.Google Scholar
Joel Andersson, Johan Åkesson, and Moritz Diehl. 2012. CasADi: A symbolic package for automatic differentiation and optimal control. In Recent Advances in Algorithmic Differentiation. Springer, 297–307.Google Scholar
Christian Bischof, Lucas Roh, and Andrew Mauer-Oats. 1997. ADIC: an extensible automatic differentiation tool for ANSI-C. Urbana 51 (1997), 61802.Google Scholar
Soumen Chakrabarti. 2007. Dynamic personalized pagerank in entity-relation graphs. In Proceedings of the 16th international conference on World Wide Web. ACM, 571–580. Google ScholarDigital Library
Hung-Hsuan Chen and C Lee Giles. 2013. ASCOS: an asymmetric network structure context similarity measure. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. ACM, 442–449. Google ScholarDigital Library
Tianqi Chen, Linpeng Tang, Qin Liu, Diyi Yang, Saining Xie, Xuezhi Cao, Chunyang Wu, Enpeng Yao, Zhengyang Liu, Zhansheng Jiang, and others. 2012. Combining factorization model and additive forest for collaborative followee recommendation. KDD CUP (2012).Google Scholar
Juan José Conti and Alejandro Russo. 2010. A taint mode for python via a library. In Nordic Conference on Secure IT Systems. Springer, 210–222.Google Scholar
Sebastian Elbaum and David S. Rosenblum. 2014. Known Unknowns: Testing in the Presence of Uncertainty. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 833–836. Google ScholarDigital Library
David A Fournier, Hans J Skaug, Johnoel Ancheta, James Ianelli, Arni Magnusson, Mark N Maunder, Anders Nielsen, and John Sibert. 2012. AD Model Builder: using automatic differentiation for statistical inference of highly parameterized complex nonlinear models. Optimization Methods and Software 27, 2 (2012), 233–249.Google ScholarCross Ref
Edward I George and Robert E McCulloch. 1993. Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88, 423 (1993), 881–889.Google ScholarCross Ref
Olivier Gevaert, Frank De Smet, Dirk Timmerman, Yves Moreau, and Bart De Moor. 2006. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics 22, 14 (2006), e184–e190. Google ScholarDigital Library
Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Kumar Sharma, Gautam Korlam, Fabricio Benevenuto, Niloy Ganguly, and Krishna Phani Gummadi. 2012. Understanding and Combating Link Farming in the Twitter Social Network. In Proceedings of the 21st International Conference on World Wide Web (WWW ’12). ACM, New York, NY, USA, 61–70. Google ScholarDigital Library
2187846Google Scholar
David Francis Gleich and Michael Saunders. 2009. Models and algorithms for pagerank sensitivity. __, Sept (2009).Google Scholar
G. H. Golub and C. Greif. 2006. An Arnoldi-type algorithm for computing page rank. BIT Numerical Mathematics 46, 4 (2006), 759–771.Google ScholarDigital Library
Andreas Griewank. 1991. Achieving Logarithmic Growth Of Temporal And Spatial Complexity In Reverse Automatic Differentiation. (1991).Google Scholar
Andreas Griewank. 2000. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. Google ScholarDigital Library
Andreas Griewank and others. 1989. On automatic differentiation. Mathematical Programming: recent developments and applications 6, 6 (1989), 83–107.Google Scholar
Andreas Griewank, David Juedes, and Jean Utke. 1996. Algorithm 755: ADOL-C: a package for the automatic differentiation of algorithms written in C/C++. ACM Transactions on Mathematical Software (TOMS) 22, 2 (1996), 131–167. Google ScholarDigital Library
Vince Grolmusz. 2015. A note on the pagerank of undirected graphs. Inform. Process. Lett. 115, 6 (2015), 633–634. Google ScholarDigital Library
GrouhLens. MovieLens Datasets. http://grouplens.org/datasets/movielens/.Google Scholar
Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, and Miryung Kim. 2016. Bigdebug: Debugging primitives for interactive big data processing in spark. In Proceedings of the 38th International Conference on Software Engineering. ACM, 784–795. Google ScholarDigital Library
Robert Ikeda, Hyunjung Park, and Jennifer Widom. 2011. Provenance for Generalized Map and Reduce Workflows. In CIDR.Google Scholar
Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. Proc. VLDB Endow. 9, 3 (Nov. 2015), 216–227. Google ScholarDigital Library
Glen Jeh and Jennifer Widom. 2002. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 538–543. Google ScholarDigital Library
Meng Jiang, Peng Cui, Rui Liu, Qiang Yang, Fei Wang, Wenwu Zhu, and Shiqiang Yang. 2012. Social contextual recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 45–54. Google ScholarDigital Library
Charles E Kahn, Linda M Roberts, Katherine A Shaffer, and Peter Haddawy. 1997. Construction of a Bayesian network for mammographic diagnosis of breast cancer. Computers in biology and medicine 27, 1 (1997), 19–29.Google Scholar
Sepandar Kamvar, Taher Haveliwala, and Gene Golub. 2004. Adaptive methods for the computation of PageRank. Linear Algebra Appl. (2004).Google Scholar
Gershon Kedem. 1980. Automatic differentiation of computer programs. ACM Transactions on Mathematical Software (TOMS) 6, 2 (1980), 150–165. Google ScholarDigital Library
Gyanendra Kumar, Neelam Duhan, and AK Sharma. 2011. Page ranking based on number of visits of links of Web page. In Computer and Communication Technology (ICCCT), 2011 2nd International Conference on. IEEE, 11–14.Google ScholarCross Ref
S. L. Lauritzen and D. J. Spiegelhalter. 1990. Readings in Uncertain Reasoning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Chapter Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems, 415–448. http://dl.acm.org/citation.cfm?id=84628.85343 Google ScholarDigital Library
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. (June 2014).Google Scholar
Dionysios Logothetis, Soumyarupa De, and Kenneth Yocum. 2013. Scalable lineage capture for debugging disc analytics. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 17. Google ScholarDigital Library
Lu Lu, Xuanhua Shi, Yongluan Zhou, Xiong Zhang, Hai Jin, Cheng Pei, Ligang He, and Yuanzhen Geng. 2016. Lifetime-based Memory Management for Distributed Data Processing Systems. Proc. VLDB Endow. 9, 12 (Aug. 2016), 936–947. Google ScholarDigital Library
Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. 2014. Large-scale Distributed Graph Computing Systems: An Experimental Evaluation. Proc. VLDB Endow. 8, 3 (Nov. 2014), 281–292. Google ScholarDigital Library
Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. Association for Computational Linguistics.Google Scholar
Christian Murphy. 2010. Metamorphic testing techniques to detect defects in applications without test oracles. Columbia University.Google Scholar
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A Not-so-foreign Language for Data Processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD ’08). ACM, New York, NY, USA, 1099–1110. Google ScholarDigital Library
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: bringing order to the web. (1999).Google Scholar
Louis B Rall. 1981. Automatic differentiation: Techniques and applications. (1981).Google Scholar
Florent Saudel and Jonathan Salwan. 2015. Triton: A Dynamic Symbolic Execution Framework. In Symposium sur la sécurité des technologies de l’information et des communications, SSTIC, France, Rennes, June 3-5 2015. SSTIC, 31–54.Google Scholar
M Berkan Sesen, Ann E Nicholson, Rene Banares-Alcantara, Timor Kadir, and Michael Brady. 2013. Bayesian networks for clinical decision support in lung cancer care. PloS one 8, 12 (2013), e82349.Google ScholarCross Ref
Neelam Tyagi and Simple Sharma. 2012. Weighted Page Rank algorithm based on number of visits of Links of web page. International Journal of Soft Computing and Engineering (IJSCE) ISSN (2012), 2231–2307.Google Scholar
Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and Validating Machine Learning Classifiers by Metamorphic Testing. J. Syst. Softw. 84, 4 (April 2011), 544–558. Google ScholarDigital Library
Wenpu Xing and Ali Ghorbani. 2004. Weighted pagerank algorithm. In Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on. IEEE, 305–314. Google ScholarDigital Library
Zhaogui Xu, Peng Liu, Xiangyu Zhang, and Baowen Xu. 2016. Python Predictive Analysis for Bug Detection. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 121–132. Google ScholarDigital Library
Fan Yang, Jinfeng Li, and James Cheng. 2016. Husky: Towards a More Efficient and Expressive Distributed Computing Framework. Proc. VLDB Endow. 9, 5 (Jan. 2016), 420–431. Google ScholarDigital Library
Jonathan S Yedidia, William T Freeman, Yair Weiss, and others. 2000. Generalized belief propagation. In NIPS, Vol. 13. 689–695. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012.Google Scholar
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, Berkeley, CA, USA, 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301 Google ScholarDigital Library
Hui Zhang, Ashish Goel, Ramesh Govindan, Kahn Mason, and Benjamin Van Roy. 2004. Making Eigenvector-Based Reputation Systems Robust to Collusion.. In WAW (Lecture Notes in Computer Science), Stefano Leonardi (Ed.), Vol. 3243.Google Scholar
Springer, 92–104.Google Scholar

Index Terms

LAMP: data provenance for graph based machine learning algorithms through derivative computation
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Learning in probabilistic graphical models
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Data Provenance Based System for Classification and Linear Regression in Distributed Machine Learning
Structured Object-Oriented Formal Language and Method
Abstract
Nowadays, data provenance is widely used to increase the accuracy of machine learning models. However, facing the difficulties in information heredity, these models produce data association. Most of the studies in the field of data provenance are ...
Read More
A Service Selection Algorithm Based on the Trust of Data Provenance
DCABES '13: Proceedings of the 2013 12th International Symposium on Distributed Computing and Applications to Business, Engineering & Science

The existing service selection methods consideration trust, basically consider the trust of QoS. In fact, the trust of service selection is not only about that of QoS, but the trust of the output data of Web service. However, the trust of the output ...
Read More
Machine Learning: The State of the Art

The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering
August 2017
1073 pages
ISBN:9781450351058
DOI:10.1145/3106237
General Chairs:
Eric Bodden
Paderborn University, Germany / Fraunhofer IEM, Germany
,
Wilhelm Schäfer
Paderborn University, Germany
,
Program Chairs:
Arie van Deursen
Delft University of Technology, Netherlands
,
Andrea Zisman
Open University, UK
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data Provenance
Debugging
Machine Learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 770
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LAMP: data provenance for graph based machine learning algorithms through derivative computation

ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data Provenance Based System for Classification and Linear Regression in Distributed Machine Learning

A Service Selection Algorithm Based on the Trust of Data Provenance

Machine Learning: The State of the Art