Abstract Lip reading interprets speech from visual lip movements, offering a vital complement to audio-based recognition in challenging acoustic environments. However, existing models rely on supervised, closed-set classification and fail to recognize out-of-vocabulary words, severely limiting their practical application. To address this zero-shot learning (ZSL) challenge, we propose VSMatch-Lip, a non-generative visual-semantic matching framework. Our approach is grounded in the insight that while lip movements represent a visual manifestation of phonetics, providing a strong physical correlation for generalization, relying solely on this correlation is insufficient due to ambiguities like homophones. Therefore, our core innovation lies in introducing a multi-source fused semantic representation that synergistically integrates lexical meaning with powerful phonetic cues. This design allows the phonetic component to ground the alignment in visual articulation, while the semantic component provides crucial disambiguation, creating a more robust and discriminative target for matching. To effectively optimize this matching process, we design a tailored contrastive learning framework with specialized optimization strategies to tackle the large intra-class variance and training instability. As a key contribution, we also establish the first comprehensive ZSL benchmark on large-scale, in-the-wild datasets. Extensive experiments on this benchmark demonstrate that VSMatch-Lip achieves state-of-the-art performance, consistently outperforming all baselines, including contemporary generative models. Notably, under a 19:1 seen-to-unseen ratio on LRW, it surpasses the strongest generative baseline by nearly 9% in Top-1 unseen accuracy. To the best of our knowledge, this is the first successful and rigorous validation of a non-generative, direct matching ZSL framework on large-scale, in-the-wild lip reading benchmarks.
Shen et al. (Thu,) studied this question.