ABSTRACT
Evaluation of user modeling techniques is often based on the predictive accuracy of models. The quantification of predictive accuracy is done using performance metrics. We show that the choice of a performance metric is important and that even details of metric computation matter. We analyze in detail two commonly used metrics (AUC, RMSE) in the context of student modeling. We discuss different approaches to their computation (global, averaging across skill, averaging across students) and show that these methods have different properties. An analysis of recent research papers shows that the reported descriptions of metric computation are often insufficient. To make research conclusions valid and reproducible, researchers need to pay more attention to the choice of performance metrics and they need to describe more explicitly details of their computation.
- Ryan Baker. 2013. A'/AUC Code. http://www.columbia.edu/~rsb216/edmtools.html. (2013).Google Scholar
- Ryan SJ Baker, Albert T Corbett, and Vincent Aleven. 2008. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Proc. of Intelligent Tutoring Systems. Springer, 406--415. Google ScholarDigital Library
- Joseph Beck. 2007. Difficulties in inferring student knowledge from observations (and why you should care) Proc. of Educational Data Mining. 21--30.Google Scholar
- Joseph E Beck and Kai-min Chang 2007. Identifiability: A fundamental problem of student modeling. User Modeling 2007. Springer, 137--146. Google ScholarDigital Library
- Joseph E Beck and Xiaolu Xiong 2013. Limits to accuracy: How well can we do at student modeling Educational Data Mining. 4--11.Google Scholar
- Glenn W Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review Vol. 78, 1 (1950), 1--3.Google Scholar
- Asif Dhanani, Seung Yeon Lee, Phitchaya Phothilimthana, and Zachary Pardos 2014. A comparison of error metrics for learning model parameters in bayesian knowledge tracing. Technical Report. Technical Report UCB/EECS-2014-131, EECS Department, University of California, Berkeley.Google Scholar
- Tom Fawcett. 2004. ROC graphs: Notes and practical considerations for researchers. Machine learning, Vol. 31, 1 (2004), 1--38.Google Scholar
- Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters Vol. 27, 8 (2006), 861--874. Google ScholarDigital Library
- James Fogarty, Ryan S Baker, and Scott E Hudson. 2005. Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction. In Proc. of Graphics Interface 2005. 129--136. Google ScholarDigital Library
- Tilmann Gneiting and Adrian E Raftery 2007. Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. Vol. 102, 477 (2007), 359--378.Google ScholarCross Ref
- Yue Gong, Joseph E Beck, and Neil T Heffernan 2010. Comparing knowledge tracing and performance factor analysis by using multiple model fitting procedures. In Intelligent Tutoring Systems. Springer, 35--44. Google ScholarDigital Library
- Yue Gong, Joseph E Beck, and Neil T Heffernan 2011. How to construct more accurate student models: Comparing and optimizing knowledge tracing and performance factor analysis. International Journal of Artificial Intelligence in Education, Vol. 21, 1--2 (2011), 27--46. Google ScholarDigital Library
- JP González-Brenes, Yun Huang, and Peter Brusilovsky. 2014. General features in knowledge tracing: Applications to multiple subskills, temporal item response theory, and expert knowledge. Proc. of Educational Data Mining. 84--91.Google Scholar
- José P González-Brenes. 2015. Modeling Skill Acquisition Over Time with Sequence and Topic Modeling. AISTATS.Google Scholar
- José P González-Brenes and Yun Huang 2015. Your model is predictive - but is it useful? theoretical and empirical considerations of a new paradigm for adaptive tutoring evaluation Proc. of Educational Data Mining.Google Scholar
- José P González-Brenes and Jack Mostow 2013. What and when do students learn? Fully data-driven joint estimation of cognitive and student models. In Proc. of Educational Data Mining. 236--240.Google Scholar
- Thomas M Hamill and Josip Juras 2006. Measuring forecast skill: is it real skill or is it the varying climatology? Quarterly Journal of the Royal Meteorological Society, Vol. 132, 621C (2006), 2905--2923.Google ScholarCross Ref
- David J Hand. 2009. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine learning, Vol. 77, 1 (2009), 103--123. Google ScholarDigital Library
- T. Käser, S. Klingler, A. G. Schwing, and M. Gross 2014. Beyond Knowledge Tracing: Modeling Skill Topologies with Bayesian Networks Proc. of Intelligent Tutoring Systems. 188--198.Google Scholar
- Tanja Käser, Kenneth R Koedinger, and Markus Gross. 2014. Different parameters - same prediction: An analysis of learning curves Proc. of Educational Data Mining. 52--59.Google Scholar
- Mohammad Khajah, Robert V Lindsey, and Michael C Mozer. 2016. How deep is knowledge tracing?. In Proc. of Educational Data Mining.Google Scholar
- Jorge M Lobo, Alberto Jiménez-Valverde, and Raimundo Real. 2008. AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography Vol. 17, 2 (2008), 145--151.Google Scholar
- Caren Marzban. 2004. The ROC curve and the area under it as performance measures. Weather and Forecasting Vol. 19, 6 (2004), 1106--1114.Google ScholarCross Ref
- Allan H Murphy. 1973. A new vector partition of the probability score. Journal of Applied Meteorology Vol. 12, 4 (1973), 595--600.Google ScholarCross Ref
- Juraj Nivznan, Radek Pelánek, and Jivrí vRihák. 2015. Student Models for Prior Knowledge Estimation. In Educational Data Mining.Google Scholar
- J. Papouvsek, R. Pelánek, and V. Stanislav. 2014. Adaptive Practice of Facts in Domains with Varied Prior Knowledge Educational Data Mining. 6--13.Google Scholar
- Zachary A Pardos, Yoav Bergner, Daniel T Seaton, and David E Pritchard 2013. Adapting Bayesian Knowledge Tracing to a Massive Open Online Course in edX Proc. of Educational Data Mining. 137--144.Google Scholar
- Zachary A Pardos, Sujith M Gowda, Ryan SJ Baker, and Neil T Heffernan 2012. The sum is greater than the parts: ensembling models of student knowledge in educational software. ACM SIGKDD explorations newsletter Vol. 13, 2 (2012), 37--44. Google ScholarDigital Library
- Zachary A Pardos and Neil T Heffernan 2011. KT-IDEM: Introducing item difficulty to the knowledge tracing model. User Modeling, Adaption and Personalization (2011), 243--254. Google ScholarDigital Library
- Zachary A Pardos and Michael V Yudelson 2013. Towards Moment of Learning Accuracy. In AIED 2013 Workshops Proceedings Volume 4. 3.Google Scholar
- Radek Pelánek. 2015. Metrics for Evaluation of Student Models. Journal of Educational Data Mining Vol. 7, 2 (2015).Google Scholar
- Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha Sohl-Dickstein 2015. Deep Knowledge Tracing. In Advances in Neural Information Processing Systems. 505--513. Google ScholarDigital Library
- Michael Sao Pedro, Ryan Shaun Baker, and Janice D Gobert. 2013. Incorporating Scaffolding and Tutor Context into Bayesian Knowledge Tracing to Predict Inquiry Skill Acquisition.. In Proc. of Educational Data Mining. 185--192.Google Scholar
- Zoltan Toth, Olivier Talagrand, Guillem Candille, and Yuejian Zhu 2003. Forecast Verification: A Practitioner's Guide in Atmospheric Science. Wiley, Chapter Probability and ensemble forecasts, 137--163.Google Scholar
- Yutao Wang and Joseph Beck 2013. Class vs. Student in a Bayesian Network Student Model Artificial Intelligence in Education. Springer, 151--160.Google Scholar
- Yutao Wang and Neil Heffernan 2013. Extending knowledge tracing to allow partial credit: using continuous versus binary nodes Artificial Intelligence in Education. Springer, 181--188.Google Scholar
- Michael V Yudelson, Kenneth R Koedinger, and Geoffrey J Gordon 2013. Individualized Bayesian Knowledge Tracing Models. Artificial Intelligence in Education. Springer, 171--180.Google Scholar
Index Terms
- Measuring Predictive Performance of User Models: The Details Matter
Recommendations
Assessing spatial predictive models in the environmental sciences
A comprehensive assessment of the performance of predictive models is necessary as they have been increasingly employed to generate spatial predictions for environmental management and conservation and their accuracy is crucial to evidence-informed ...
Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction? An Empirical Study
Background. Slice-based cohesion metrics leverage program slices with respect to the output variables of a module to quantify the strength of functional relatedness of the elements within the module. Although slice-based cohesion metrics have been ...
Predictive vector quantization with ridge regression
DCC '96: Proceedings of the Conference on Data CompressionPrediction can play an important role in image compression. Better rate-distortion tradeoffs can be achieved by coding the residuals from predictive schemes rather than the direct pixel values. The price is only a modest increase in complexity. A ...
Comments