Multimodal Early Prediction of Student Problem Outcomes from Vision and Interaction Logs in an Intelligent Math Tutor

Margrit Betke1, Amelia Harper1
1Boston University, Massachusetts, MA 02215, USA
DOI: https://doi.org/10.71448/bcds2342-3
Published: 30/08/2023
Cite this article as: Margrit Betke, Amelia Harper. Multimodal Early Prediction of Student Problem Outcomes from Vision and Interaction Logs in an Intelligent Math Tutor. Bulletin of Computer and Data Sciences, Volume 4 Issue 2. Page: 30-42.

Abstract

Vision-based engagement detection in intelligent tutoring systems has shown that facial expressions and head pose are informative for recognizing student affect. However, current approaches are mostly unimodal and often perform classification after the entire problem-solving attempt is completed, which limits the ability of the tutor to react in time. In this paper we propose a multimodal early-prediction framework that fuses visual affect indicators with interaction-level tutor logs (mouse and keyboard activity, hint requests, response latency, and problem difficulty metadata) to forecast problem outcomes such as correct, give up, and hint-dependent while only a fraction of the attempt has unfolded. Using a MathSpring-like scenario, we build synchronized sequences of video-derived features and platform events, and train temporal models to (i) quantify the incremental value of each modality, and (ii) determine how early reliable prediction is possible. Our experiments show that (1) adding simple interaction features to vision improves F1 for the give-up class by a large margin, (2) multimodal fusion maintains robust performance even when only the first 30% of the interaction is available, and (3) the relative importance of modalities changes over time: vision dominates in the first seconds, while interaction features become stronger as students start requesting hints or pausing. These findings provide an empirical basis for real-time, vision-triggered interventions in math tutoring environments and point to a practical recipe for classrooms where bandwidth and privacy constraints make full video processing difficult.

Keywords: intelligent tutoring systems, multimodal learning analytics, affect detection, early prediction, student engagement, math education

Abstract

Vision-based engagement detection in intelligent tutoring systems has shown that facial expressions and head pose are informative for recognizing student affect. However, current approaches are mostly unimodal and often perform classification after the entire problem-solving attempt is completed, which limits the ability of the tutor to react in time. In this paper we propose a multimodal early-prediction framework that fuses visual affect indicators with interaction-level tutor logs (mouse and keyboard activity, hint requests, response latency, and problem difficulty metadata) to forecast problem outcomes such as correct, give up, and hint-dependent while only a fraction of the attempt has unfolded. Using a MathSpring-like scenario, we build synchronized sequences of video-derived features and platform events, and train temporal models to (i) quantify the incremental value of each modality, and (ii) determine how early reliable prediction is possible. Our experiments show that (1) adding simple interaction features to vision improves F1 for the give-up class by a large margin, (2) multimodal fusion maintains robust performance even when only the first 30% of the interaction is available, and (3) the relative importance of modalities changes over time: vision dominates in the first seconds, while interaction features become stronger as students start requesting hints or pausing. These findings provide an empirical basis for real-time, vision-triggered interventions in math tutoring environments and point to a practical recipe for classrooms where bandwidth and privacy constraints make full video processing difficult.

Keywords: intelligent tutoring systems, multimodal learning analytics, affect detection, early prediction, student engagement, math education
Margrit Betke
Boston University, Massachusetts, MA 02215, USA
Amelia Harper
Boston University, Massachusetts, MA 02215, USA

DOI

Cite this article as:

Margrit Betke, Amelia Harper. Multimodal Early Prediction of Student Problem Outcomes from Vision and Interaction Logs in an Intelligent Math Tutor. Bulletin of Computer and Data Sciences, Volume 4 Issue 2. Page: 30-42.

Publication history

Copyright © 2023 Margrit Betke, Amelia Harper. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Browse Advance Search