ABSTRACT
In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features, together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen's kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.
- Z. Akyol, J. B. Arbaugh, M. Cleveland-Innes, D. R. Garrison, P. Ice, J. C. Richardson, and K. Swan. A response to the review of the community of inquiry framework. Journal of distance education, 23(2), 2009. URL: http://www.ijede.ca/index.php/jde/article/view/630/884.Google Scholar
- T. Anderson and J. Dron. Three generations of distance education pedagogy. The international review of research in open and distance learning, 12(3):80--97, 2010. URL: http://www.irrodl.org/index.php/irrodl/article/view/890/.Google Scholar
- T. Anderson, L. Rourke, D. R. Garrison, and W. Archer. Assessing teaching presence in a computer conferencing context. Journal of asynchronous learning networks, 5:1--17, 2001. URL: http://auspace.athabascau.ca/handle/2149/725.Google Scholar
- J. B. Arbaugh, A. Bangert, and M. Cleveland-Innes. Subject matter effects and the community of inquiry (coi) framework: an exploratory study. The internet and higher education, 13(1):37--44, 2010.Google Scholar
- J. Arbaugh, M. Cleveland-Innes, S. R. Diaz, D. R. Garrison, P. Ice, J. C. Richardson, and K. P. Swan. Developing a community of inquiry instrument: testing a measure of the community of inquiry framework using a multi-institutional sample. The internet and higher education, 11(3--4):133--136, 2008.Google Scholar
- L. Breiman. Random Forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- D. L. Butler and P. H. Winne. Feedback and self-regulated learning: a theoretical synthesis. Review of educational research, 65(3):245--281, 1995.Google Scholar
- N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD explorations newsletter, 6(1):1--6, 2004. Google ScholarDigital Library
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research:321--357, 2002. URL: https://www.jair.org/media/953/live-953-2037-jair.pdf. Google ScholarDigital Library
- Coh-Metrix 3.0 indicies. http://cohmetrix.com/documentation_indices.html.Google Scholar
- S. Corich, K. Hunt, and L. Hunt. Computerised content analysis for measuring critical thinking within discussion forums. Journal of e-learning and knowledge society, 2(1), 2012. URL: http://www.jelks.org/ojs/index.php/Je-LKS_EN/article/view/700.Google Scholar
- B. De Wever, T. Schellens, M. Valcke, and H. Van Keer. Content analysis schemes to analyze transcripts of online asynchronous discussion groups: a review. Computers & education, 46(1):6--28, 2006. Google ScholarDigital Library
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the american society for information science, 41(6):391--407, 1990.Google Scholar
- J. Dewey. My pedagogical creed. School journal, 54(3):77--80, 1897.Google Scholar
- P. Dönmez, C. Rosé, K. Stegmann, A. Weinberger, and F. Fischer. Supporting CSCL with automatic corpus analysis technology. In Proceedings of th 2005 conference on computer support for collaborative learning: learning 2005: the next 10 years!, 2005, 125--134. URL: https://telearn.archives-ouvertes.fr/hal-00190638. Google ScholarDigital Library
- R. Donnelly and J. Gardner. Content analysis of computer conferencing transcripts. Interactive learning environments, 19(4):303--315, 2011. URL: http://eprints.teachingandlearning.ie/3930/.Google Scholar
- N. Dowell, O. Skrypnyk, S. Joksimović, A. C. Graesser, S. Dawson, D. Gašević, P. d. Vries, T. Hennis, and V. Kovanović. Modeling Learners' Social Centrality and Performance through Language and Discourse. In Proceedings of the 8th International Conference on Educational Data Mining (EDM 2015), 2015. URL: http://www.educationaldatamining.org/EDM2015/proceedings/full250-257.pdf.Google Scholar
- M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133--3181, 2014. URL: http://jmlr.org/papers/v15/delgado14a.html. Google ScholarDigital Library
- P. Ferragina and U. Scaiella. Fast and accurate annotation of short texts with wikipedia pages. Software, ieee, 29(1):70--75, 2012. Google ScholarDigital Library
- P. W. Foltz, W. Kintsch, and T. K. Landauer. The measurement of textual coherence with latent semantic analysis. Discourse processes, 25:285--307, 1998. URL: http://eric.ed.gov/?id=EJ589329.Google ScholarCross Ref
- E. Gabrilovich and S. Markovitch. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence. Morgan Kaufmann Publishers Inc., 2007, pp. 1606--1611. URL: http://dl.acm.org/citation.cfm?id=1625275.1625535. Google ScholarDigital Library
- D. Gašević, O. Adesope, S. Joksimović, and V. Kovanović. Externally-facilitated regulation scaffolding and role assignment to develop cognitive presence in asynchronous online discussions. The internet and higher education, 24:53--65, 2015.Google Scholar
- D. R. Garrison, T. Anderson, and W. Archer. Critical inquiry in a text-based environment: computer conferencing in higher education. The internet and higher education, 2(2-3):87--105, 1999.Google Scholar
- D. R. Garrison, T. Anderson, and W. Archer. Critical thinking, cognitive presence, and computer conferencing in distance education. American journal of distance education, 15(1):7--23, 2001.Google Scholar
- D. R. Garrison, T. Anderson, and W. Archer. The first decade of the community of inquiry framework: a retrospective. The internet and higher education, 13(1--2):5--9, 2010.Google Scholar
- R. Garrison, M. Cleveland-Innes, and T. S. Fung. Exploring causal relationships among teaching, cognitive and social presence: student perceptions of the community of inquiry framework. The internet and higher education, 13(1--2):31--36, 2010.Google Scholar
- L. Getoor. Introduction to Statistical Relational Learning. MIT Press, 2007. ISBN: 978-0-262-07288-5. Google ScholarDigital Library
- P. Gorsky, A. Caspi, I. Blau, Y. Vine, and A. Billet. Toward a coi population parameter: the impact of unit (sentence vs. message) on the results of quantitative content analysis. The international review of research in open and distributed learning, 13(1):17--37, 2011. URL: http://www.irrodl.org/index.php/irrodl/article/view/1073.Google Scholar
- A. C. Graesser, D. S. McNamara, and J. M. Kulikowich. Coh-Metrix Providing Multilevel Analyses of Text Characteristics. Educational researcher, 40(5):223--234, 2011.Google Scholar
- O. R. Holsti. Content analysis for the social sciences and humanities. Addison-Wesley Reading, MA, 1969.Google Scholar
- M. K. C. f. Jed Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, Z. Mayer, B. Kenkel, t. R Core Team, M. Benesty, R. Lescarbeau, A. Ziem, L. Scrucca, Y. Tang, and C. Candan. Caret: classification and regression training. R package version 6.0-58, 2015. URL: http://CRAN.R-project.org/package=caret.Google Scholar
- S. Joksimović, N. Dowell, O. Skrypnyk, V. Kovanović, D. Gašević, S. Dawson, and A. C. Graesser. Exploring the Accumulation of Social Capital in cMOOC Through Language and Discourse. Submitted, 2015.Google Scholar
- S. Joksimović, D. Gašević, V. Kovanović, O. Adesope, and M. Hatala. Psychological characteristics in cognitive presence of communities of inquiry: A linguistic analysis of online discussions. The internet and higher education, 22:1--10, 2014.Google Scholar
- S. Joksimović, V. Kovanović, J. Jovanović, A. Zouaq, D. Gašević, and M. Hatala. What Do cMOOC Participants Talk About in Social Media?: A Topic Analysis of Discourse in a cMOOC. In Proceedings of the Fifth International Conference on Learning Analytics And Knowledge, 2015, pp. 156--165. Google ScholarDigital Library
- V. Kovanović, S. Joksimović, D. Gašević, and M. Hatala. Automated Content Analysis of Online Discussion Transcripts. In Proceedings of the Workshops at the LAK 2014 Conference co-located with 4th International Conference on Learning Analytics and Knowledge (LAK 2014), 2014. URL: http://ceur-ws.org/Vol-1137/.Google Scholar
- V. Kovanović, S. Joksimović, D. Gašević, M. Hatala, and G. Siemens. Content Analytics: the definition, scope, and an overview of published research. In, Handbook of Learning Analyitcs, 2015.Google Scholar
- K. H. Krippendorff. Content analysis: an introduction to its methodology. Sage Publications, 2003.Google Scholar
- J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning (ICML '01), 2001. URL: http://dl.acm.org/citation.cfm?id=655813. Google ScholarDigital Library
- J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159--174, 1977.Google ScholarCross Ref
- A. Liaw and M. Wiener. Classification and regression by random-forest. R news, 2(3):18--22, 2002. URL: http://CRAN.R-project.org/doc/Rnews/.Google Scholar
- G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding variable importances in forests of randomized trees. In Advances in neural information processing systems 26, 2013, pp. 431--439. URL: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/281.pdf.Google ScholarDigital Library
- R. Luppicini. Review of computer mediated communication research for education. Instructional science, 35(2):141--185, 2007.Google ScholarCross Ref
- E. Mayfield and C. Penstein-Rosé. Using feature construction to avoid large feature spaces in text classification. In Proceedings of the 12th annual conference on genetic and evolutionary computation, 2010, 1299--1306. Google ScholarDigital Library
- T. McKlin. Analyzing Cognitive Presence in Online Courses Using an Artificial Neural Network. PhD thesis. Georgia State University, College of Education, 2004. Google ScholarDigital Library
- D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai. Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge University Press, 2014. Google ScholarCross Ref
- P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems, 2011, 1--8. Google ScholarDigital Library
- J. Mu, K. Stegmann, E. Mayfield, C. Rosé, and F. Fischer. The ACODEA framework: developing segmentation and classification schemes for fully automatic analysis of online discussions. International journal of computer-supported collaborative learning, 7(2):285--305, 2012.Google ScholarCross Ref
- E. B. Page and N. S. Petersen. The computer moves into essay grading: Updating the ancient test. Phi delta kappan, 76(7):561, 1995. URL: http://search.proquest.com/docview/218533317/abstract.Google Scholar
- C. L. Park. Replicating the Use of a Cognitive Presence Measurement Tool. Journal of interactive online learning, 8:140--155, 2, 2009. URL: http://www.ncolr.org/issues/jiol/v8/n2/replicating-the-use-of-a-cognitive-presence-measurement-tool#.VrVSebKUFhE.Google Scholar
- L. Rourke, T. Anderson, D. R. Garrison, and W. Archer. Assessing social presence in asynchronous text-based computer conferencing. The journal of distance education/ revue de l'éducation à distance, 14(2):50--71, 2007. URL: http://eric.ed.gov/?id=EJ616753.Google Scholar
- L. Rourke, T. Anderson, D. R. Garrison, and W. Archer. Methodological issues in the content analysis of computer conference transcripts. International journal of artificial intelligence in education (IJAIED), 12:8--22, 2001.Google Scholar
- P. J. Stone, D. C. Dunphy, and M. S. Smith. The general inquirer: a computer approach to content analysis. MIT press, 1966.Google Scholar
- J.-W. Strijbos. Assessment of (computer-supported) collaborative learning. IEEE transactions on learning technologies, 4(1):59--73, 2011. Google ScholarDigital Library
- J.-W. Strijbos, R. L. Martens, F. J. Prins, and W. M. G. Jochems. Content analysis: what are they talking about? Computers & education, 46(1):29--48, 2006. Google ScholarDigital Library
- M. Strube and S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2. AAAI Press, 2006, pp. 1419--1424. ISBN: 978-1-57735-281-5. URL: http://dl.acm.org/citation.cfm?id=1597348.1597414. Google ScholarDigital Library
- P.-N. Tan, V. Kumar, and M. Steinbach. Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc., 2005. ISBN: 0-321-32136-7.Google ScholarDigital Library
- Y. R. Tausczik and J. W. Pennebaker. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of language and social psychology, 29(1):24--54, 2010.Google Scholar
- Y. R. Tausczik and J. W. Pennebaker. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of language and social psychology, 29(1):24--54, 2010.Google Scholar
- V. N. Vapnik. Statistical learning theory. Wiley-Interscience, 1998.Google ScholarDigital Library
- J. Vassileva. Toward social learning environments. IEEE transactions on learning technologies, 1(4):199--214, 2008. Google ScholarDigital Library
- N. Vaughan and D. R. Garrison. Creating cognitive presence in a blended faculty development community. The internet and higher education, 8(1):1--12, 2005.Google Scholar
- Z. Waters, V. Kovanović, K. Kitto, and D. Gašević. Structure matters: Adoption of structured classification approach in the context of cognitive presence classification. In Proceedings of the 11th Asia Information Retrieval Societies Conference, AIRS 2015, 2015.Google ScholarCross Ref
- I. H. Witten, E. Frank, and M. A. Hall. Data mining: practical machine learning tools and techniques. Morgan Kaufmann, 3rd ed., 2011. Google ScholarDigital Library
- A. Zouaq and R. Nkambou. Building domain ontologies from text for educational purposes. IEEE transactions on learning technologies, 1(1):49--62, 2008. Google ScholarDigital Library
Index Terms
- Towards automated content analysis of discussion transcripts: a cognitive presence case
Recommendations
Towards automatic content analysis of social presence in transcripts of online discussions
LAK '20: Proceedings of the Tenth International Conference on Learning Analytics & KnowledgeThis paper presents an approach to automatic labeling of the content of messages in online discussion according to the categories of social presence. To achieve this goal, the proposed approach is based on a combination of traditional text mining ...
Automated Analysis of Cognitive Presence in Online Discussions Written in Portuguese
Lifelong Technology-Enhanced LearningAbstractThis paper presents a method for automated content analysis of students’ messages in asynchronous discussions written in Portuguese. In particular, the paper looks at the problem of coding discussion transcripts for the levels of cognitive ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Comments