Gaze-Supervised Hierarchical Attention Networks for Fine-Grained Visual Classification

Edwin R. Hancock1
1School of Computer Science and Engineering, State Key Laboratory of Software, University of York, U.K.
DOI: https://doi.org/10.71448/bcds2341-1
Published: 30/04/2023
Cite this article as: Edwin R. Hancock. Gaze-Supervised Hierarchical Attention Networks for Fine-Grained Visual Classification. Bulletin of Computer and Data Sciences, Volume 4 Issue 1. Page: 1-14.

Abstract

Fine-grained visual classification (FGVC) becomes especially challenging when categories are organized hierarchically and the discriminative cues shrink from global shapes (order/family) to tiny parts (genus/species). Existing hierarchy-aware methods such as CHRF learn level-specific attentions implicitly, but they only use human gaze as a post-hoc validation signal, leaving a rich source of supervision unused. In this work we introduce GS-HAN, a gaze-supervised hierarchical attention network that explicitly aligns model attention with human fixation patterns at every level of the taxonomy. GS-HAN builds on a backbone feature extractor and CHRF-style region feature mining, but augments each hierarchy level with gaze-conditioned attention heads and a Hierarchical Gaze Alignment Loss that combines KL divergence and cosine similarity to match human gaze distributions. We further retain cross-hierarchical orthogonal fusion so that coarse-level, gaze-aligned context can enhance fine-level recognition. Evaluations on CUB-200-2011 with ARISTO gaze, as well as on Butterfly-200, VegFru, FGVC-Aircraft, and Stanford Cars, show that GS-HAN consistently outperforms strong FGVC baselines and hierarchy-aware methods, achieving 90.8% on CUB and clear gains at the most fine-grained (species) level. Ablations verify that (i) direct gaze supervision—not just hierarchy—drives the improvements, (ii) our loss improves quantitative gaze–attention similarity, and (iii) even partial gaze availability yields benefits. The results demonstrate that human gaze is an effective, underexploited supervisory signal for hierarchical FGVC, improving both accuracy and interpretability.

Keywords: fine-grained visual classification, hierarchical classification, human gaze supervision, attention alignment, CHRF, interpretability, bird recognition, ARISTO dataset, cross-hierarchical fusion, deep learning

Abstract

Fine-grained visual classification (FGVC) becomes especially challenging when categories are organized hierarchically and the discriminative cues shrink from global shapes (order/family) to tiny parts (genus/species). Existing hierarchy-aware methods such as CHRF learn level-specific attentions implicitly, but they only use human gaze as a post-hoc validation signal, leaving a rich source of supervision unused. In this work we introduce GS-HAN, a gaze-supervised hierarchical attention network that explicitly aligns model attention with human fixation patterns at every level of the taxonomy. GS-HAN builds on a backbone feature extractor and CHRF-style region feature mining, but augments each hierarchy level with gaze-conditioned attention heads and a Hierarchical Gaze Alignment Loss that combines KL divergence and cosine similarity to match human gaze distributions. We further retain cross-hierarchical orthogonal fusion so that coarse-level, gaze-aligned context can enhance fine-level recognition. Evaluations on CUB-200-2011 with ARISTO gaze, as well as on Butterfly-200, VegFru, FGVC-Aircraft, and Stanford Cars, show that GS-HAN consistently outperforms strong FGVC baselines and hierarchy-aware methods, achieving 90.8% on CUB and clear gains at the most fine-grained (species) level. Ablations verify that (i) direct gaze supervision—not just hierarchy—drives the improvements, (ii) our loss improves quantitative gaze–attention similarity, and (iii) even partial gaze availability yields benefits. The results demonstrate that human gaze is an effective, underexploited supervisory signal for hierarchical FGVC, improving both accuracy and interpretability.

Keywords: fine-grained visual classification, hierarchical classification, human gaze supervision, attention alignment, CHRF, interpretability, bird recognition, ARISTO dataset, cross-hierarchical fusion, deep learning
Edwin R. Hancock
School of Computer Science and Engineering, State Key Laboratory of Software, University of York, U.K.

DOI

Cite this article as:

Edwin R. Hancock. Gaze-Supervised Hierarchical Attention Networks for Fine-Grained Visual Classification. Bulletin of Computer and Data Sciences, Volume 4 Issue 1. Page: 1-14.

Publication history

Copyright © 2023 Edwin R. Hancock. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Browse Advance Search