Fine-grained visual classification (FGVC) becomes especially challenging when categories are organized hierarchically and the discriminative cues shrink from global shapes (order/family) to tiny parts (genus/species). Existing hierarchy-aware methods such as CHRF learn level-specific attentions implicitly, but they only use human gaze as a post-hoc validation signal, leaving a rich source of supervision unused. In this work we introduce GS-HAN, a gaze-supervised hierarchical attention network that explicitly aligns model attention with human fixation patterns at every level of the taxonomy. GS-HAN builds on a backbone feature extractor and CHRF-style region feature mining, but augments each hierarchy level with gaze-conditioned attention heads and a Hierarchical Gaze Alignment Loss that combines KL divergence and cosine similarity to match human gaze distributions. We further retain cross-hierarchical orthogonal fusion so that coarse-level, gaze-aligned context can enhance fine-level recognition. Evaluations on CUB-200-2011 with ARISTO gaze, as well as on Butterfly-200, VegFru, FGVC-Aircraft, and Stanford Cars, show that GS-HAN consistently outperforms strong FGVC baselines and hierarchy-aware methods, achieving 90.8% on CUB and clear gains at the most fine-grained (species) level. Ablations verify that (i) direct gaze supervision—not just hierarchy—drives the improvements, (ii) our loss improves quantitative gaze–attention similarity, and (iii) even partial gaze availability yields benefits. The results demonstrate that human gaze is an effective, underexploited supervisory signal for hierarchical FGVC, improving both accuracy and interpretability.
Fine-grained visual classification (FGVC) becomes especially challenging when categories are organized hierarchically and the discriminative cues shrink from global shapes (order/family) to tiny parts (genus/species). Existing hierarchy-aware methods such as CHRF learn level-specific attentions implicitly, but they only use human gaze as a post-hoc validation signal, leaving a rich source of supervision unused. In this work we introduce GS-HAN, a gaze-supervised hierarchical attention network that explicitly aligns model attention with human fixation patterns at every level of the taxonomy. GS-HAN builds on a backbone feature extractor and CHRF-style region feature mining, but augments each hierarchy level with gaze-conditioned attention heads and a Hierarchical Gaze Alignment Loss that combines KL divergence and cosine similarity to match human gaze distributions. We further retain cross-hierarchical orthogonal fusion so that coarse-level, gaze-aligned context can enhance fine-level recognition. Evaluations on CUB-200-2011 with ARISTO gaze, as well as on Butterfly-200, VegFru, FGVC-Aircraft, and Stanford Cars, show that GS-HAN consistently outperforms strong FGVC baselines and hierarchy-aware methods, achieving 90.8% on CUB and clear gains at the most fine-grained (species) level. Ablations verify that (i) direct gaze supervision—not just hierarchy—drives the improvements, (ii) our loss improves quantitative gaze–attention similarity, and (iii) even partial gaze availability yields benefits. The results demonstrate that human gaze is an effective, underexploited supervisory signal for hierarchical FGVC, improving both accuracy and interpretability.