ABSTRACT
Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work.
- Lada A. Adamic, Jun Zhang, Eytan Bakshy, and Mark S. Ackerman. 2008. Knowledge Sharing and Yahoo Answers: Everyone Knows Something. In Proceedings of the 17th International Conference on World Wide Web (WWW ’08). ACM, New York, NY, USA, 665–674. Google ScholarDigital Library
- Syed Ahmed and Mehdi Bagherzadeh. 2018. What Do Concurrency Developers Ask About?: A Large-scale Study Using Stack Overflow. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ’18). ACM, New York, NY, USA, Article 30, 10 pages. Google ScholarDigital Library
- Miltiadis Allamanis and Charles Sutton. 2013. Why, when, and what: analyzing Stack Overflow questions by topic, type, and code. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 53–56. Google ScholarDigital Library
- Apache Flink. {n.d.}. https://flink.apache.org/.Google Scholar
- Apache Storm. {n.d.}. http://storm.apache.org/.Google Scholar
- Kartik Bajaj, Karthik Pattabiraman, and Ali Mesbah. 2014. Mining Questions Asked by Web Developers. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014). ACM, New York, NY, USA, 112–121. Google ScholarDigital Library
- Anton Barua, Stephen W Thomas, and Ahmed E Hassan. 2014. What are developers talking about? an analysis of topics and trends in Stack Overflow. Empirical Software Engineering 19, 3 (2014), 619–654. Google ScholarDigital Library
- Andrew Begel and Thomas Zimmermann. 2014. Analyze This! 145 Questions for Data Scientists in Software Engineering. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 12–23. Google ScholarDigital Library
- Lauren R. Biggers, Cecylia Bocovich, Riley Capshaw, Brian P. Eddy, Letha H. Etzkorn, and Nicholas A. Kraft. 2014. Configuring Latent Dirichlet Allocation Based Feature Location. Empirical Softw. Engg. 19, 3 (June 2014), 465–500. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/ citation.cfm?id=944919.944937 Google Scholar
- C.L. Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275 (2014), 314 – 347.Google ScholarCross Ref
- Sally Fincher and Josh Tenenberg. 2005. Making sense of card sorting data. Expert Systems 22, 3 (2005), 89–93.Google ScholarCross Ref
- Danyel Fisher, Rob DeLine, Mary Czerwinski, and Steven Drucker. 2012. Interactions with Big Data Analytics. interactions 19, 3 (May 2012), 50–59. Google ScholarDigital Library
- Diego Garbervetsky, Zvonimir Pavlinovic, Michael Barnett, Madanlal Musuvathi, Todd Mytkowicz, and Edgardo Zoppi. 2017. Static Analysis for Optimizing Big Data Queries. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). ACM, New York, NY, USA, 932–937. Google ScholarDigital Library
- Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, and Miryung Kim. 2016. BigDebug: Interactive Debugger for Big Data Analytics in Apache Spark. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 1033–1037. Google ScholarDigital Library
- Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. BigSift: Automated Debugging of Big Data Analytics in Data-intensive Scalable Computing. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). ACM, New York, NY, USA, 863–866. Google ScholarDigital Library
- Zoltán Gyöngyi, Georgia Koutrika, Jan Pedersen, and Hector Garcia-Molina. 2008. Questioning Yahoo! Answers. In Proceedings of the 1st workshop on question answering on the Web.Google Scholar
- Harlan Harris, Sean Murphy, and Marck Vaisman. 2013. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O’Reilly Media, Inc. Google ScholarDigital Library
- A. Hindle, M. W. Godfrey, and R. C. Holt. 2009. What’s hot and what’s not: Windowed developer topic analysis. In 2009 IEEE International Conference on Software Maintenance. 339–348.Google Scholar
- Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Transactions on Visualization and Computer Graphics 18, 12 (Dec. 2012), 2917–2926. Google ScholarDigital Library
- Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Syed Ahmed. 2018. A Tool for Optimizing Java 8 Stream Software via Automated Refactoring. In International Working Conference on Source Code Analysis and Manipulation (SCAM ’18). IEEE, 34–39.Google Scholar
- Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Syed Ahmed. 2019. Safe Automated Refactoring for Intelligent Parallelization of Java 8 Streams. In International Conference on Software Engineering (ICSE ’19). ACM/IEEE, NJ, USA, 619–630. Google ScholarDigital Library
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The Emerging Role of Data Scientists on Software Development Teams. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 96–107. Google ScholarDigital Library
- M. Kim, T. Zimmermann, R. DeLine, and A. Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44, 11 (Nov 2018), 1024–1038.Google ScholarCross Ref
- 2754374Google Scholar
- Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery 18, 1 (01 Feb 2009), 140–181. 008-0114-1 Google ScholarDigital Library
- Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. SIGARCH Comput. Archit. News 44, 2 (March 2016), 517–530. Google ScholarDigital Library
- C. Li, H. Zhuang, K. Lu, M. Sun, J. Zhou, D. Dai, and X. Zhou. 2014. An Adaptive Auto-configuration Tool for Hadoop. In 2014 19th International Conference on Engineering of Complex Computer Systems. 69–72. 2014.17 Google ScholarDigital Library
- Mallet stop words. {n.d.}. https://github.com/mengjunxie/aelda/blob/master/misc/mallet-stopwords-en.txt.Google Scholar
- Lena Mamykina, Bella Manoim, Manas Mittal, George Hripcsak, and Björn Hartmann. 2011. Design Lessons from the Fastest QA Site in the West. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). ACM, New York, NY, USA, 2857–2866. Google ScholarDigital Library
- Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. (2002). http://mallet.cs.umass.edu.Google Scholar
- Sarah Nadi, Stefan Krüger, Mira Mezini, and Eric Bodden. 2016. "Jumping Through Hoops": Why do Java Developers Struggle with Cryptography APIs?. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). Google ScholarDigital Library
- Nick Wingfield. September 2011. Wall Street Journal https://www.wsj.com/ articles/SB10001424053111904823804576502442835413446.Google Scholar
- Gustavo Pinto, Fernando Castor, and Yu David Liu. 2014. Mining Questions About Software Energy Consumption. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014). ACM, New York, NY, USA, 22–31. Google ScholarDigital Library
- Gustavo Pinto, Weslley Torres, and Fernando Castor. 2015. A Study on the Most Popular Questions About Concurrent Programming. In Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU 2015). ACM, New York, NY, USA, 39–46. 2846680.2846687 Google ScholarDigital Library
- M. F. Porter. 1997. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Chapter An Algorithm for Suffix Stripping, 313–316. http://dl.acm.org/citation.cfm?id=275537.275705 Google ScholarDigital Library
- Christoffer Rosen and Emad Shihab. 2016. What are mobile developers asking about? a large scale study using Stack Overflow. Empirical Software Engineering 21, 3 (2016), 1192–1223. Google ScholarDigital Library
- Weiyi Shang, Zhen Ming Jiang, Hadi Hemmati, Bram Adams, Ahmed E. Hassan, and Patrick Martin. 2013. Assisting Developers of Big Data Analytics Applications when Deploying on Hadoop Clouds. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 402–411. http://dl.acm.org/citation.cfm?id=2486788.2486842 Google ScholarDigital Library
- Stack Exchange Dump. June 2017. https://archive.org/details/stackexchange.Google Scholar
- Stack Overflow. December 2018. http://www.stackoverflow.com/.Google Scholar
- Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do programmers ask and answer questions on the web?: Nier track. In Software Engineering (ICSE), 2011 33rd International Conference on. IEEE, 804–807. Google ScholarDigital Library
- Xin-Li Yang, David Lo, Xin Xia, Zhi-Yuan Wan, and Jian-Ling Sun. 2016. What Security Questions Do Developers Ask? A Large-Scale Study of Stack Overflow Posts. Journal of Computer Science and Technology 31, 5 (01 Sep 2016), 910–924.Google ScholarCross Ref
Index Terms
- Going big: a large-scale study on what big data developers ask
Comments