Vision-based engagement detection in intelligent tutoring systems has shown that facial expressions and head pose are informative for recognizing student affect. However, current approaches are mostly unimodal and often perform classification after the entire problem-solving attempt is completed, which limits the ability of the tutor to react in time. In this paper we propose a multimodal early-prediction framework that fuses visual affect indicators with interaction-level tutor logs (mouse and keyboard activity, hint requests, response latency, and problem difficulty metadata) to forecast problem outcomes such as correct, give up, and hint-dependent while only a fraction of the attempt has unfolded. Using a MathSpring-like scenario, we build synchronized sequences of video-derived features and platform events, and train temporal models to (i) quantify the incremental value of each modality, and (ii) determine how early reliable prediction is possible. Our experiments show that (1) adding simple interaction features to vision improves F1 for the give-up class by a large margin, (2) multimodal fusion maintains robust performance even when only the first 30% of the interaction is available, and (3) the relative importance of modalities changes over time: vision dominates in the first seconds, while interaction features become stronger as students start requesting hints or pausing. These findings provide an empirical basis for real-time, vision-triggered interventions in math tutoring environments and point to a practical recipe for classrooms where bandwidth and privacy constraints make full video processing difficult.
Vision-based engagement detection in intelligent tutoring systems has shown that facial expressions and head pose are informative for recognizing student affect. However, current approaches are mostly unimodal and often perform classification after the entire problem-solving attempt is completed, which limits the ability of the tutor to react in time. In this paper we propose a multimodal early-prediction framework that fuses visual affect indicators with interaction-level tutor logs (mouse and keyboard activity, hint requests, response latency, and problem difficulty metadata) to forecast problem outcomes such as correct, give up, and hint-dependent while only a fraction of the attempt has unfolded. Using a MathSpring-like scenario, we build synchronized sequences of video-derived features and platform events, and train temporal models to (i) quantify the incremental value of each modality, and (ii) determine how early reliable prediction is possible. Our experiments show that (1) adding simple interaction features to vision improves F1 for the give-up class by a large margin, (2) multimodal fusion maintains robust performance even when only the first 30% of the interaction is available, and (3) the relative importance of modalities changes over time: vision dominates in the first seconds, while interaction features become stronger as students start requesting hints or pausing. These findings provide an empirical basis for real-time, vision-triggered interventions in math tutoring environments and point to a practical recipe for classrooms where bandwidth and privacy constraints make full video processing difficult.