What question did this study set out to answer?

This research investigates how integrating visual and semantic data affects few-shot image classification.

April 19, 2026Open Access

Asymmetric Cross-Modal Prototypical Networks for Few-Shot Image Classification

Key Points

This research investigates how integrating visual and semantic data affects few-shot image classification.
Developed Asymmetric Cross-Modal Prototypical Networks (ACM-ProtoNet) using frozen CLIP text encoders.
Implemented a scalar fusion gate to balance visual and semantic features.
Conducted experiments on miniImageNet with one-shot and five-shot classifications, applying both ablation studies and statistical testing.
Found a significant improvement in five-shot learning (+2.12 percentage points) with cross-modal integration.
Noted that random text features reduced performance by 7.4–8.1 percentage points.
Observed that text-only prototypes performed near-random (≈20% accuracy).
Ablation studies confirmed semantic binding was essential, showing consistent degradation when class names were shuffled.

Abstract

Few-shot image classification requires models to generalize from limited labeled examples. While metric-based approaches such as Prototypical Networks have demonstrated strong performance, they rely exclusively on visual features and ignore the rich semantic information encoded in class names. This paper presents a systematic empirical study investigating the interaction between visual and semantic modalities in few-shot learning. We present Asymmetric Cross-Modal Prototypical Networks(ACM-ProtoNet), a controlled experimental framework which augments standard prototypical learning with frozen CLIP text encoders to incorporate zero-cost linguistic priors. Our method explicitly models the symmetric relationshipbetween visual and semantic modalities through learnable projection heads that map both image and text features into a shared embedding space. Image and text prototypes are fused via a learnable scalar gate α∈(0,1), allowing adaptive balancing of modalities. Under our experimental setup (frozen CLIP encoders, scalar fusion gate, simple template-based prompts), we observe an asymmetric pattern in comprehensive ablation studies on miniImageNet: cross-modal integration yields a statistically significant improvement in five-shot (+2.12 pp, p=0.03125, Wilcoxon signed-rank test over five seeds) but not in one-shot (−0.09 pp, n.s.) learning. Our key contribution is not achieving state-of-the-art accuracy but rather providing controlled empirical evidence about cross-modal interaction patterns under specific design constraints. Further analysis shows that: (1) structured semantic information is essential—random text features harm performance by 7.4–8.1 percentage points; (2) projection heads provide asymmetric benefits, more critical in one-shot (−2.85 pp when removed) than in five-shot learning (−0.74 pp); (3) text-only prototypes achieve near-random performance (≈20%), suggesting that semantics alone are insufficient in our setup; (4) shuffled-class-name ablation confirms genuine semantic binding, where randomly permuting class-name assignments causes consistent degradation (five-shot: −5.74 pp, p<0.001; one-shot: −3.83 pp, p<0.001 across five seeds). These findings, specific to our simple fusion design, reveal an asymmetric pattern that is equally consistent with two hypotheses: (i) semantic priors may require sufficient visual context to be useful, or (ii) our scalar fusion gate may lack the capacity to leverage text in the extreme low-data regime of one-shot learning. This ambiguity motivates future work with more expressive fusion mechanisms and stronger text representations.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Shengyu Xie

Guobin Deng

Xingxing Yang

Journals

Symmetry

Actions

Institutions

Hong Kong Baptist University

Minzu University of China

Nanning Normal University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Asymmetric Cross-Modal Prototypical Networks for Few-Shot Image Classification

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study