skip to main content
10.1145/3338906.3338939acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Going big: a large-scale study on what big data developers ask

Published:12 August 2019Publication History

ABSTRACT

Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work.

References

  1. Lada A. Adamic, Jun Zhang, Eytan Bakshy, and Mark S. Ackerman. 2008. Knowledge Sharing and Yahoo Answers: Everyone Knows Something. In Proceedings of the 17th International Conference on World Wide Web (WWW ’08). ACM, New York, NY, USA, 665–674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Syed Ahmed and Mehdi Bagherzadeh. 2018. What Do Concurrency Developers Ask About?: A Large-scale Study Using Stack Overflow. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ’18). ACM, New York, NY, USA, Article 30, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Miltiadis Allamanis and Charles Sutton. 2013. Why, when, and what: analyzing Stack Overflow questions by topic, type, and code. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 53–56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache Flink. {n.d.}. https://flink.apache.org/.Google ScholarGoogle Scholar
  5. Apache Storm. {n.d.}. http://storm.apache.org/.Google ScholarGoogle Scholar
  6. Kartik Bajaj, Karthik Pattabiraman, and Ali Mesbah. 2014. Mining Questions Asked by Web Developers. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014). ACM, New York, NY, USA, 112–121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Anton Barua, Stephen W Thomas, and Ahmed E Hassan. 2014. What are developers talking about? an analysis of topics and trends in Stack Overflow. Empirical Software Engineering 19, 3 (2014), 619–654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrew Begel and Thomas Zimmermann. 2014. Analyze This! 145 Questions for Data Scientists in Software Engineering. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 12–23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lauren R. Biggers, Cecylia Bocovich, Riley Capshaw, Brian P. Eddy, Letha H. Etzkorn, and Nicholas A. Kraft. 2014. Configuring Latent Dirichlet Allocation Based Feature Location. Empirical Softw. Engg. 19, 3 (June 2014), 465–500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. http://dl.acm.org/ citation.cfm?id=944919.944937 Google ScholarGoogle Scholar
  11. C.L. Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275 (2014), 314 – 347.Google ScholarGoogle ScholarCross RefCross Ref
  12. Sally Fincher and Josh Tenenberg. 2005. Making sense of card sorting data. Expert Systems 22, 3 (2005), 89–93.Google ScholarGoogle ScholarCross RefCross Ref
  13. Danyel Fisher, Rob DeLine, Mary Czerwinski, and Steven Drucker. 2012. Interactions with Big Data Analytics. interactions 19, 3 (May 2012), 50–59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Diego Garbervetsky, Zvonimir Pavlinovic, Michael Barnett, Madanlal Musuvathi, Todd Mytkowicz, and Edgardo Zoppi. 2017. Static Analysis for Optimizing Big Data Queries. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). ACM, New York, NY, USA, 932–937. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, and Miryung Kim. 2016. BigDebug: Interactive Debugger for Big Data Analytics in Apache Spark. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 1033–1037. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. BigSift: Automated Debugging of Big Data Analytics in Data-intensive Scalable Computing. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). ACM, New York, NY, USA, 863–866. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Zoltán Gyöngyi, Georgia Koutrika, Jan Pedersen, and Hector Garcia-Molina. 2008. Questioning Yahoo! Answers. In Proceedings of the 1st workshop on question answering on the Web.Google ScholarGoogle Scholar
  18. Harlan Harris, Sean Murphy, and Marck Vaisman. 2013. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O’Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Hindle, M. W. Godfrey, and R. C. Holt. 2009. What’s hot and what’s not: Windowed developer topic analysis. In 2009 IEEE International Conference on Software Maintenance. 339–348.Google ScholarGoogle Scholar
  20. Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Transactions on Visualization and Computer Graphics 18, 12 (Dec. 2012), 2917–2926. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Syed Ahmed. 2018. A Tool for Optimizing Java 8 Stream Software via Automated Refactoring. In International Working Conference on Source Code Analysis and Manipulation (SCAM ’18). IEEE, 34–39.Google ScholarGoogle Scholar
  22. Raffi Khatchadourian, Yiming Tang, Mehdi Bagherzadeh, and Syed Ahmed. 2019. Safe Automated Refactoring for Intelligent Parallelization of Java 8 Streams. In International Conference on Software Engineering (ICSE ’19). ACM/IEEE, NJ, USA, 619–630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The Emerging Role of Data Scientists on Software Development Teams. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 96–107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Kim, T. Zimmermann, R. DeLine, and A. Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44, 11 (Nov 2018), 1024–1038.Google ScholarGoogle ScholarCross RefCross Ref
  25. 2754374Google ScholarGoogle Scholar
  26. Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery 18, 1 (01 Feb 2009), 140–181. 008-0114-1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. SIGARCH Comput. Archit. News 44, 2 (March 2016), 517–530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Li, H. Zhuang, K. Lu, M. Sun, J. Zhou, D. Dai, and X. Zhou. 2014. An Adaptive Auto-configuration Tool for Hadoop. In 2014 19th International Conference on Engineering of Complex Computer Systems. 69–72. 2014.17 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mallet stop words. {n.d.}. https://github.com/mengjunxie/aelda/blob/master/misc/mallet-stopwords-en.txt.Google ScholarGoogle Scholar
  30. Lena Mamykina, Bella Manoim, Manas Mittal, George Hripcsak, and Björn Hartmann. 2011. Design Lessons from the Fastest QA Site in the West. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). ACM, New York, NY, USA, 2857–2866. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. (2002). http://mallet.cs.umass.edu.Google ScholarGoogle Scholar
  32. Sarah Nadi, Stefan Krüger, Mira Mezini, and Eric Bodden. 2016. "Jumping Through Hoops": Why do Java Developers Struggle with Cryptography APIs?. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nick Wingfield. September 2011. Wall Street Journal https://www.wsj.com/ articles/SB10001424053111904823804576502442835413446.Google ScholarGoogle Scholar
  34. Gustavo Pinto, Fernando Castor, and Yu David Liu. 2014. Mining Questions About Software Energy Consumption. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014). ACM, New York, NY, USA, 22–31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Gustavo Pinto, Weslley Torres, and Fernando Castor. 2015. A Study on the Most Popular Questions About Concurrent Programming. In Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU 2015). ACM, New York, NY, USA, 39–46. 2846680.2846687 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. F. Porter. 1997. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Chapter An Algorithm for Suffix Stripping, 313–316. http://dl.acm.org/citation.cfm?id=275537.275705 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Christoffer Rosen and Emad Shihab. 2016. What are mobile developers asking about? a large scale study using Stack Overflow. Empirical Software Engineering 21, 3 (2016), 1192–1223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Weiyi Shang, Zhen Ming Jiang, Hadi Hemmati, Bram Adams, Ahmed E. Hassan, and Patrick Martin. 2013. Assisting Developers of Big Data Analytics Applications when Deploying on Hadoop Clouds. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 402–411. http://dl.acm.org/citation.cfm?id=2486788.2486842 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Stack Exchange Dump. June 2017. https://archive.org/details/stackexchange.Google ScholarGoogle Scholar
  40. Stack Overflow. December 2018. http://www.stackoverflow.com/.Google ScholarGoogle Scholar
  41. Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do programmers ask and answer questions on the web?: Nier track. In Software Engineering (ICSE), 2011 33rd International Conference on. IEEE, 804–807. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Xin-Li Yang, David Lo, Xin Xia, Zhi-Yuan Wan, and Jian-Ling Sun. 2016. What Security Questions Do Developers Ask? A Large-Scale Study of Stack Overflow Posts. Journal of Computer Science and Technology 31, 5 (01 Sep 2016), 910–924.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Going big: a large-scale study on what big data developers ask

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
        August 2019
        1264 pages
        ISBN:9781450355728
        DOI:10.1145/3338906

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate112of543submissions,21%

        Upcoming Conference

        FSE '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader