ABSTRACT
Literate programming tools are used by millions of programmers today, and are intended to facilitate presenting data analyses in the form of a narrative. We interviewed 21 data scientists to study coding behaviors in a literate programming environment and how data scientists kept track of variants they explored. For participants who tried to keep a detailed history of their experimentation, both informal and formal versioning attempts led to problems, such as reduced notebook readability. During iteration, participants actively curated their notebooks into narratives, although primarily through cell structure rather than markdown explanations. Next, we surveyed 45 data scientists and asked them to envision how they might use their past history in an future version control system. Based on these results, we give design guidance for future literate programming tools, such as providing history search based on how programmers recall their explorations, through contextual details including images and parameters.
Supplemental Material
- Hugh Beyer and Karen Holtzblatt. 1997. Contextual design: defining customer-centered systems. Elsevier. Google ScholarDigital Library
- Eric A. Bier, Maureen C. Stone, Ken Pier, William Buxton, and Tony D. DeRose. 1993. Toolglass and magic lenses: the see-through interface. In Proceedings of the 20th annual conference on Computer graphics and interactive techniques, 73--80. Google ScholarDigital Library
- Joel Brandt, Philip J. Guo, Joel Lewenstein, and Scott R. Klemmer. 2008. Opportunistic programming: How rapid ideation and prototyping occur in practice. In Proceedings of the 4th international workshop on End-user software engineering, 1--5. Google ScholarDigital Library
- Juliet Corbin and Anselm Strauss. 1990. Grounded Theory Research: Procedures, Canons and Evaluative Criteria. Zeitschrift für Soziologie 19, 6: 515.Google Scholar
- The Sage Developers. SageMath, the Sage Mathematics Software System (Version x.y.z).Google Scholar
- Danyel Fisher, Badrish Chandramouli, Robert DeLine, Jonathan Goldstein, Andrei Aron, Mike Barnett, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Tempe: an interactive data science environment for exploration of temporal and streaming data. Tech. Rep. MSR-TR-2014--148.Google Scholar
- Apache Software Foundation. 2017. Apache Zeppelin 0.7.0. Retrieved from https://zeppelin.apache.org/Google Scholar
- Maik Riechert. 2016. Repairing Bad Pixels. Retrieved January 6, 2018 from https://github.com/letmaik/rawpynotebooks/blob/master/bad-pixel-repair/bad-pixelrepair.ipynbGoogle Scholar
- Philip Jia Guo. 2012. Software tools to facilitate research programming. Ph.D. Dissertation. Stanford University.Google Scholar
- Philip J. Guo and Margo I. Seltzer. 2012. Burrito: Wrapping your lab notebook in computational infrastructure. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance. Google ScholarDigital Library
- Charles Hill, Rachel Bellamy, Thomas Erickson, and Margaret Burnett. 2016. Trials and tribulations of developers of intelligent systems: A field study. In Visual Languages and Human-Centric Computing (VL/HCC), 2016 IEEE Symposium on, 162--170.Google ScholarCross Ref
- Scott E. Hudson, Roy Rodenstein, and Ian Smith. 1997. Debugging Lenses: A New Class of Transparent Tools for User Interface Debugging. In Proceedings of the 10th Annual ACM Symposium on User Interface Software and Technology (UIST '97), 179--187. Google ScholarDigital Library
- Wolfram Research Inc. Mathematica, Version 11.2.Google Scholar
- Mary Beth Kery, Amber Horvath, and Brad A. Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '17), 1265--1276. Google ScholarDigital Library
- Clemens Nylandsted Klokmose and Pär-Ola Zander. 2010. Rethinking Laboratory Notebooks. In Proceedings of COOP 2010. Springer, London, 119-- 139.Google ScholarCross Ref
- Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian E. Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B. Hamrick, Jason Grout, Sylvain Corlay, and Others. 2016. Jupyter Notebooks-a publishing format for reproducible computational workflows. In ELPUB, 87--90.Google Scholar
- Donald Ervin Knuth. 1984. Literate programming. Computer Journal 27, 2: 97--111. Google ScholarDigital Library
- Timothy C. Lethbridge, Janice Singer, and Andrew Forward. 2003. How software engineers use documentation: The state of the practice. IEEE Software 20, 6: 35--39. Google ScholarDigital Library
- Robert C. Martin. 2009. Clean code: a handbook of agile software craftsmanship. Pearson Education. Google ScholarDigital Library
- Gerard Oleksik, Natasa Milic-Frayling, and Rachel Jones. 2014. Study of Electronic Lab Notebook Design and Practices That Emerged in a Collaborative Scientific Environment. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW '14), 120--133. Google ScholarDigital Library
- David Lorge Parnas. 1994. Software aging. In Proceedings of the 16th international conference on Software engineering, 279--287.Google ScholarDigital Library
- Kayur Patel. 2010. Lowering the barrier to applying machine learning. In Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology, 355--358. Google ScholarDigital Library
- Kayur Patel, James Fogarty, James A. Landay, and Beverly Harrison. 2008. Investigating statistical machine learning as a tool for software development. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 667--676. Google ScholarDigital Library
- Fernando Pérez and Brian E. Granger. 2007. IPython: a System for Interactive Scientific Computing. Computing in Science and Engineering 9, 3: 21--29. Google ScholarDigital Library
- Fernando Perez and Brian E. Granger. 2015. Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science. Project Jupyter Blog. Retrieved from http://blog.jupyter.org/2015/07/07/project-jupytercomputational-narratives-as-the-engine-ofcollaborative-data-science/Google Scholar
- Helen Shen. 2014. Interactive notebooks: Sharing the code. Nature 515, 7525: 151.Google Scholar
- Sruti Srinivasa Ragavan, Sandeep Kaur Kuttal, Charles Hill, Anita Sarma, David Piorkowski, and Margaret Burnett. 2016. Foraging Among an Overabundance of Similar Variants. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16), 3509--3521. Google ScholarDigital Library
- Jean-Luc R. Stevens, Marco Elver, and James A. Bednar. 2013. An automated and reproducible workflow for running and analyzing neural simulations using Lancet and IPython Notebook. Frontiers in neuroinformatics 7.Google Scholar
- Aurélien Tabard, Wendy E. Mackay, and Evelyn Eastmond. 2008. From Individual to Collaborative: The Evolution of Prism, a Hybrid Laboratory Notebook. In Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work (CSCW '08), 569--578. Google ScholarDigital Library
- Greg Wilson. 2006. Software carpentry: getting scientists to write better code by making them more productive. Computing in science & engineering 8, 6: 66--69. Google ScholarDigital Library
- Yihui Xie. 2014. knitr: a comprehensive tool for reproducible research in R. Implement Reprod Res 1: 20.Google Scholar
- Youngseok Yoon, Brad A. Myers, and Sebon Koo. 2013. Visualization of fine-grained code change history. In Visual Languages and Human-Centric Computing (VL/HCC), 2013 IEEE Symposium on, 119--126.Google ScholarCross Ref
- 12/2015. Jupyter Notebook 2015 UX Survey Results. Jupyter Project Github Repository. Retrieved from https://github.com/jupyter/surveys/blob/master/survey s/2015--12-notebookux/analysis/report_dashboard.ipynbGoogle Scholar
- 2013. Databricks. Retrieved from https://databricks.com/Google Scholar
Index Terms
- The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool
Recommendations
Towards Effective Foraging by Data Scientists to Find Past Analysis Choices
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing SystemsData scientists are responsible for the analysis decisions they make, but it is hard for them to track the process by which they achieved a result. Even when data scientists keep logs, it is onerous to make sense of the resulting large number of history ...
IUI4EUD: intelligent user interfaces for end-user development
IUI '20: Proceedings of the 25th International Conference on Intelligent User InterfacesEnd-User Developers program to meet some goal other than the code itself. This includes scientists, data analysts, and the general public when they write code. We have been working for many years on various ways to make end-user development more ...
Reimagining literate programming
OOPSLA '09: Proceedings of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applicationsIn this paper we describe Ginger, a new language with first class support for literate programming. Literate programming is a philosophy that argues computer programs should be written as literature with human readability and understanding of paramount ...
Comments