Speech-driven face generation aims to synthesize a face image that matches a speaker’s identity from speech alone. However, existing methods typically trade identity fidelity for visual quality and rely on large end-to-end generators that are difficult to train and tune. We propose Vox2Face, a speech-driven face generation framework centered on an explicit identity space rather than direct speech-to-image mapping. A pretrained speaker encoder first extracts speech embeddings, which are distilled and metric-aligned to the ArcFace hyperspherical identity space, transforming cross-modal regression into a geometrically interpretable speech-to-identity alignment problem. On this unified identity representation, we reused an identity-conditioned diffusion model as the generative backbone and synthesized diverse, high-resolution faces in the Stable Diffusion latent space. To better exploit this prior, we introduce a discriminator-free diffusion self-consistency loss that treats denoising residuals as an implicit critique of speech-predicted identity embeddings and updates only the speech-to-identity mapping and lightweight LoRA adapters, encouraging speech-derived identities to lie on the high-probability identity manifold of the diffusion model. Experiments on the HQ-VoxCeleb dataset show that Vox2Face improves the ArcFace cosine similarity from 0.295 to 0.322, boosts R@10 retrieval accuracy from 29.8% to 32.1%, and raises the VGGFace Score from 18.82 to 23.21 over a strong diffusion baseline. These results indicate that aligning speech to a unified identity space and reusing a strong identity-conditioned diffusion prior is an effective method to jointly improve identity fidelity and visual quality.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ma et al. (Sat,) studied this question.
www.synapsesocial.com/papers/699405774e9c9e835dfd64d4 — DOI: https://doi.org/10.3390/info17020200
Qiming Ma
Yizhen Wang
Xiang Sun
Building similarity graph...
Analyzing shared references across papers
Loading...