Few-shot image classification requires models to generalize from limited labeled examples. While metric-based approaches such as Prototypical Networks have demonstrated strong performance, they rely exclusively on visual features and ignore the rich semantic information encoded in class names. This paper presents a systematic empirical study investigating the interaction between visual and semantic modalities in few-shot learning. We present Asymmetric Cross-Modal Prototypical Networks(ACM-ProtoNet), a controlled experimental framework which augments standard prototypical learning with frozen CLIP text encoders to incorporate zero-cost linguistic priors. Our method explicitly models the symmetric relationshipbetween visual and semantic modalities through learnable projection heads that map both image and text features into a shared embedding space. Image and text prototypes are fused via a learnable scalar gate α∈(0,1), allowing adaptive balancing of modalities. Under our experimental setup (frozen CLIP encoders, scalar fusion gate, simple template-based prompts), we observe an asymmetric pattern in comprehensive ablation studies on miniImageNet: cross-modal integration yields a statistically significant improvement in five-shot (+2.12 pp, p=0.03125, Wilcoxon signed-rank test over five seeds) but not in one-shot (−0.09 pp, n.s.) learning. Our key contribution is not achieving state-of-the-art accuracy but rather providing controlled empirical evidence about cross-modal interaction patterns under specific design constraints. Further analysis shows that: (1) structured semantic information is essential—random text features harm performance by 7.4–8.1 percentage points; (2) projection heads provide asymmetric benefits, more critical in one-shot (−2.85 pp when removed) than in five-shot learning (−0.74 pp); (3) text-only prototypes achieve near-random performance (≈20%), suggesting that semantics alone are insufficient in our setup; (4) shuffled-class-name ablation confirms genuine semantic binding, where randomly permuting class-name assignments causes consistent degradation (five-shot: −5.74 pp, p<0.001; one-shot: −3.83 pp, p<0.001 across five seeds). These findings, specific to our simple fusion design, reveal an asymmetric pattern that is equally consistent with two hypotheses: (i) semantic priors may require sufficient visual context to be useful, or (ii) our scalar fusion gate may lack the capacity to leverage text in the extreme low-data regime of one-shot learning. This ambiguity motivates future work with more expressive fusion mechanisms and stronger text representations.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shengyu Xie
Guobin Deng
Xingxing Yang
Symmetry
Hong Kong Baptist University
Minzu University of China
Nanning Normal University
Building similarity graph...
Analyzing shared references across papers
Loading...
Xie et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69e4741c010ef96374d8fd7c — DOI: https://doi.org/10.3390/sym18040670