Gaze-Supervised Hierarchical Attention Networks for Fine-Grained Visual Classification

Research article

Gaze-Supervised Hierarchical Attention Networks for Fine-Grained Visual Classification

^¹

¹School of Computer Science and Engineering, State Key Laboratory of Software, University of York, U.K.

Volume 4 Issue 1

DOI: https://doi.org/10.71448/bcds2341-1

Published: 30/04/2023

Cite this article as: Edwin R. Hancock. Gaze-Supervised Hierarchical Attention Networks for Fine-Grained Visual Classification. Bulletin of Computer and Data Sciences, Volume 4 Issue 1. Page: 1-14.

Abstract

Fine-grained visual classification (FGVC) becomes especially challenging when categories are organized hierarchically and the discriminative cues shrink from global shapes (order/family) to tiny parts (genus/species). Existing hierarchy-aware methods such as CHRF learn level-specific attentions implicitly, but they only use human gaze as a post-hoc validation signal, leaving a rich source of supervision unused. In this work we introduce GS-HAN, a gaze-supervised hierarchical attention network that explicitly aligns model attention with human fixation patterns at every level of the taxonomy. GS-HAN builds on a backbone feature extractor and CHRF-style region feature mining, but augments each hierarchy level with gaze-conditioned attention heads and a Hierarchical Gaze Alignment Loss that combines KL divergence and cosine similarity to match human gaze distributions. We further retain cross-hierarchical orthogonal fusion so that coarse-level, gaze-aligned context can enhance fine-level recognition. Evaluations on CUB-200-2011 with ARISTO gaze, as well as on Butterfly-200, VegFru, FGVC-Aircraft, and Stanford Cars, show that GS-HAN consistently outperforms strong FGVC baselines and hierarchy-aware methods, achieving 90.8% on CUB and clear gains at the most fine-grained (species) level. Ablations verify that (i) direct gaze supervision—not just hierarchy—drives the improvements, (ii) our loss improves quantitative gaze–attention similarity, and (iii) even partial gaze availability yields benefits. The results demonstrate that human gaze is an effective, underexploited supervisory signal for hierarchical FGVC, improving both accuracy and interpretability.

Keywords: fine-grained visual classification, hierarchical classification, human gaze supervision, attention alignment, CHRF, interpretability, bird recognition, ARISTO dataset, cross-hierarchical fusion, deep learning

Abstract

Edwin R. Hancock

School of Computer Science and Engineering, State Key Laboratory of Software, University of York, U.K.

edwin.hancock@york.ac.uk

DOI

https://doi.org/10.71448/bcds2341-1

Cite this article as:

Edwin R. Hancock. Gaze-Supervised Hierarchical Attention Networks for Fine-Grained Visual Classification. Bulletin of Computer and Data Sciences, Volume 4 Issue 1. Page: 1-14.

Publication history

Received: 10/10/2022
Revised: 12/01/2023
Accepted: 22/03/2023
Published: 30/04/2023

Copyright © 2023 Edwin R. Hancock. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.