ABSTRACT
Despite the globalization of software development, relevant documentation of a project, such as requirements and design documents, often still is missing, incomplete or outdated. However, parts of that documentation can be found outside the project, where it is fragmented across hundreds of textual web documents like blog posts, email messages and forum posts, as well as multimedia documents such as screencasts and podcasts. Since dissecting and filtering multimedia information based on its relevancy to a given project is an inherently difficult task, it is necessary to provide an automated approach for mining this crowd-based documentation. In this paper, we are interested in mining the speech part of YouTube screencasts, since this part typically contains the rationale and insights of a screencast. We introduce a methodology that transcribes and analyzes the transcribed text using various Information Extraction (IE) techniques, and present a case study to illustrate the applicability of our mining methodology. In this case study, we extract use case scenarios from WordPress tutorial videos and show how their content can supplement existing documentation. We then evaluate how well existing rankings of video content are able to pinpoint the most relevant videos for a given scenario.
- Turk, D., France, R., and Rumpe, B., (2002), Limitations of Agile Software Processes, Proceedings of the Third International Conferenceon extreme Programming and Agile Processes in Software Engineering, p. 43--46.Google Scholar
- C. Parnin, C. Treude, L. Grammel, and M.-A. Storey, "Crowd documentation: Exploring the coverage and the dynamics of API discussions on Stack Overflow," Georgia Institute of Technology, Tech. Rep., 2012.Google Scholar
- L. MacLeod, "Code, Camera, Action!: How Software Developers Document and Share Program Knowledge Using YouTube," in 2015 IEEE 23rd International Conference on Program Comprehension, 2015, p. 11. Google ScholarDigital Library
- L. Bao, J. Li, Z. Xing, X. Wang and B. Zhou, "Reverse engineering time-series interaction data from screen-captured videos", Proc. 22nd IEEE International Conference on Software Analysis, Evolution and Reengineering, pp. 399--408.Google Scholar
- L. Bao, J. Li, Z. Xing, and X. Wang, "Extracting and analyzing time-series HCI data from screen-captured task videos," Empir. Softw. Eng., 2016, pp. 1--41.Google Scholar
- A. Pappu and A. Stent, "Automatic Formatted Transcripts for Videos," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015, pp. 2514--2518.Google Scholar
- T. Oba, T. Hori, and A. Nakamura, "Sentence boundary detection using sequential dependency analysis combined with crf-based chunking," vol. 3, 2006, pp. 1153--1156.Google Scholar
- T. Oba, T. Hori and A. Nakamura, "Improved sequential dependency analysis integrating labeling-based sentence boundary detection", IEICE Trans. Inf. Syst., vol. E93-D, no. 5, 2010, pp. 1272--1281.Google Scholar
- S. Takahashi and T. Morimoto, "N-gram language model based on multi-word expressions in web documents for speech recognition and closed-captioning," Proc. of the Int. Conf. Asian Lang. Process. (IALP), 2012. Google ScholarDigital Library
- Michael W. Berry, Ed., Survey of Text Mining. Springer New York, 2004. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," J. Mach. Learn. Res., Mar. 2003, vol. 3, pp. 993--1022. Google ScholarDigital Library
- C. Parnin and C. Treude, "Measuring API Documentation on the Web," in Proceedings of Web2SE 2011, New York, NY, USA, 2011, pp. 25--30. Google ScholarDigital Library
- S. M. Nasehi, "What Makes a Good Code Example? A Study of Programming Q&A in StackOverflow", Proc. 28th IEEE Int Conf. Software Maintenance, 2012, pp. 25--34. Google ScholarDigital Library
- L. Bao, Z. Xing, X. Wang, and B. Zhou, "Tracking and Analyzing Cross-Cutting Activities in Developers' Daily Work.", In: Proceedings of the 30th IEEE/ACM international conference on automated software engineering.Google Scholar
- H. C. Jiau and F.-P. Yang, "Facing up to the inequality of crowdsourced API documentation," ACM SIGSOFT Softw. Eng. Notes, 2012, vol. 37, no. 1, pp. 1--9. Google ScholarDigital Library
- O. Barzilay, C. Treude, and A. Zagalsky, Facilitating Crowd Sourced Software Engineering via Stack Overflow. New York: Springer, 2013, pp. 297--316.Google Scholar
- S. Subramanian, L. Inozemtseva, and R. Holmes, "Live API documentation," in Proceedings of the 36th International Conference on Software Engineering - ICSE, 2014, pp. 643--652. Google ScholarDigital Library
- J.-C. Campbell, Chenlei Zhang, Zhen Xu, A. Hindle, and J. Miller, "Deficient documentation detection a methodology to locate deficient project documentation using topic analysis," 10th IEEE Work. Conf. Min. Softw. Repos. (MSR), 2013, pp. 57--60. Google ScholarDigital Library
- R. Pham, L. Singer, O. Liskin, F. Figueira Filho, and K. Schneider, "Creating a shared understanding of testing culture on a social coding site," 35th Int. Conf. Softw. Eng. (ICSE), 2013, pp. 112--121. Google ScholarDigital Library
Index Terms
- On mining crowd-based speech documentation
Recommendations
Feature location using crowd-based screencasts
MSR '18: Proceedings of the 15th International Conference on Mining Software RepositoriesCrowd-based multi-media documents such as screencasts have emerged as a source for documenting requirements of agile software projects. For example, screencasts can describe buggy scenarios of a software product, or present new features in an upcoming ...
A feature location approach for mapping application features extracted from crowd-based screencasts to source code
AbstractCrowd-based multimedia documents such as screencasts have emerged as a source for documenting requirements, the workflow and implementation issues of open source and agile software projects. For example, users can show and narrate how they ...
API documentation and software community values: a survey of open-source API documentation
SIGDOC '13: Proceedings of the 31st ACM international conference on Design of communicationStudies of what software developers need from API documentation have reported consistent findings over the years; however, these studies all used similar methods--usually a form of observation or survey. Our study looks at API documentation as artifacts ...
Comments