Human gaze has been widely explored as a supervisory signal for fine-grained recognition and zero-shot learning (ZSL). However, existing models merely align machine attention with human gaze using auxiliary losses, leaving the fundamental causal role of gaze in human decision-making unexplored. This leads to models that correlate with gaze but do not reason through it. We propose Causal-Gaze Zero-Shot Learning (CG-ZSL), the first causal framework that integrates human gaze as an explicit mediator in the visual-semantic reasoning pathway. We formalize the ZSL pipeline as a Structural Causal Model (SCM), where human gaze acts as an intermediate variable linking visual attributes to class-level semantic embeddings. Using this formulation, we derive counterfactual attention invariance, enabling the disentanglement of causal attribute regions from dataset-specific biases. We further introduce a Causal Attention Intervention (CAI) module and a Gaze-Mediated Semantic Alignment (GMSA) mechanism that enforce bidirectional causal consistency between gaze, attributes, and predictions. Experiments on CUB, SUN, and AWA2 datasets show significant improvements over state-of-the-art ZSL and GZSL models, especially under distribution shift and domain generalization settings. Unlike prior attention-alignment systems, CG-ZSL produces human-interpretable, causally-grounded explanations and maintains performance under counterfactual perturbations.
Human gaze has been widely explored as a supervisory signal for fine-grained recognition and zero-shot learning (ZSL). However, existing models merely align machine attention with human gaze using auxiliary losses, leaving the fundamental causal role of gaze in human decision-making unexplored. This leads to models that correlate with gaze but do not reason through it. We propose Causal-Gaze Zero-Shot Learning (CG-ZSL), the first causal framework that integrates human gaze as an explicit mediator in the visual-semantic reasoning pathway. We formalize the ZSL pipeline as a Structural Causal Model (SCM), where human gaze acts as an intermediate variable linking visual attributes to class-level semantic embeddings. Using this formulation, we derive counterfactual attention invariance, enabling the disentanglement of causal attribute regions from dataset-specific biases. We further introduce a Causal Attention Intervention (CAI) module and a Gaze-Mediated Semantic Alignment (GMSA) mechanism that enforce bidirectional causal consistency between gaze, attributes, and predictions. Experiments on CUB, SUN, and AWA2 datasets show significant improvements over state-of-the-art ZSL and GZSL models, especially under distribution shift and domain generalization settings. Unlike prior attention-alignment systems, CG-ZSL produces human-interpretable, causally-grounded explanations and maintains performance under counterfactual perturbations.